MY SRE Task
Automation
Deployment automation - Jenkins and GitHub actions
AWS automation - Python boto3
others API automate - Python request module (Loki, ClickHouse, Okta)
Ansible for user onboarding and offboarding
bash scripting for Linux-like backup and push to S3 / user creation
PowerShell - for Windows automation
Terraform for VPC , EKS
common
user onboarding, installing software, backup
Monitoring
prometheus, grafana, loki
system metrics - node expo
logs - loki
traces - Python library
blackbox for SSL, healthcheck, and other metrics
SLA Dashboard
Understand the ELK Stack flow and system
Data
SQL - Postgresql
data warehouse - clickhouse
data analytics - Apache Superset
data pipeline - airbyte
REAL TIME TASKS
job failed - create ticket
automated sop
ai bot - run scripts and try to fix issue - if fixed no fixed assign ticket to sre will analyse, rca, debugg and fix issue
monitoring alert - alertmanager (ssl, health check) job failure alert (semaphoreui - teams and outlook and ticket) build failure alert (cicd - ticket)
issues/incident disk full memory high usage ssl expiry domain down server down pipeline failed job execution failed new intern onboarding jobs - install software, create user, grant access newapi/web app deployment migration exploring new tools improving system performance developer issues - access denied, cors error requesst blocked - check waf events setup waf rules db backups db connection issue - pooltimeout supp=iscious activity like rate limit or block bot setting new environment for client log rotate rotate secret - vault docker image building
learningnew systems like kubernetes SRE Pricinples
focusing on how to make system betetr like leveraging kubernetes concepts like for latency leverage pod affinity - to put db and api pod closer
monitoring systems - metrics, traces, logs, events, auditing
Last updated