Day to Day Life
manage container workload on docker and Kubernetes
manage container deployment
manage Linux environment
disk cleanup
memory
CPU
debugging — logging — nginx, container
manage nginx and certbot
Clean up old/unused Docker images and containers.
Scan Docker images for vulnerabilities (Trivy, AWS scan)
Manage Kubernetes objects: Deployments, Services, Ingress, ConfigMaps, Secrets.
Monitor Pod health using kubectl get pods -o wide.
Identify & resolve failed pods
Maintain CI/CD pipelines for Docker & Kubernetes
database backup and restore
manage dns and waf rules for security (cloudflare)
document steps
checking dashboard (Grafana) for system health
monitoring system during event (sync, large query)
responding to alert and incident
Investigating logs and metrics to debug system issues.
Conducting post-mortems and writing RCA (Root Cause Analysis).
Writing scripts (Bash, Python) for automating deployments, monitoring, and scaling. (backup, SSL renew,) ansible
Implementing auto-scaling mechanisms (Kubernetes HPA, Karpenter, AWS ASG).
Fix build alert
Dependency Issues (package version not suitable, outdated version, missing environment variable)
build timeout (memory/disk full)
db connection issue
deployment speed reduction
explore new technologies and implement it
Manage CICD
Role-based Access Control plugin
https://www.youtube.com/watch?v=HqfWj244x4U
Backup and restore
thin backup plugin
https://youtu.be/5Tb-AOUFuKQ?si=uMrEbQZPd13i6q0E
Add agent
update Jenkins
https://youtu.be/u4tIbFF4thM?si=wlxVNZEWd8d2l-lr
Check Jenkins logs (/var/log/jenkins/jenkins.log or /var/lib/jenkins/logs).
Look for failing jobs and identify issues.
Monitor Jenkins UI: Manage Jenkins → System Information.
Restart Jenkins if needed (systemctl restart jenkins).
Add/remove users using RBAC (Role-Based Access Control).
Ensure plugins are up-to-date but avoid breaking changes.
Assign proper permissions for developers, QA, and admins.
Clean up old builds (Manage Jenkins → Build Discarder
Use ThinBackup or tar to back up /var/lib/jenkins/.
Delete old jobs & unused workspaces.
Reduce job execution time by caching dependencies.
database(postgres), caching(redis)
grafana (rbca) view, create panel
Mnage prometheus and grafana
Monitor firing alerts in Alertmanager:
Fix failing alert rules in Prometheus Alerting Rules (rules.yml).
Delete old metrics if storage is full:
prometheus — add agent , setup alert
If targets show DOWN, restart Prometheus exporters:
Update Grafana dashboards (adjust panels, fix broken queries).
Optimize PromQL queries to reduce resource usage.
backup prometheus tsdb, grafana
Rotate Grafana admin passwords.
Delete old Prometheus data:
prometheus --storage.tsdb.retention.time=15d
Manage user roles & permissions (Manage Users → Org Roles).
Remove Grafana logs older than 30 days:
Upgrade Prometheus & Grafana
Secure Prometheus & Grafana with basic auth / OAuth / SSO.
build alert on teams (alerting — webhook email, teams)
manage cloud environment — AWS
manage services deployed on docker and compose
scanning containers and checking vulnerabilities
Learning about new trends in SRE & DevOps principles.
Last updated