Day to Day Life

manage container workload on docker and Kubernetes

manage container deployment

manage Linux environment

  • disk cleanup

  • memory

  • CPU

  • debugging — logging — nginx, container

manage nginx and certbot

Clean up old/unused Docker images and containers.

Scan Docker images for vulnerabilities (Trivy, AWS scan)

Manage Kubernetes objects: Deployments, Services, Ingress, ConfigMaps, Secrets.

Monitor Pod health using kubectl get pods -o wide.

Identify & resolve failed pods

Maintain CI/CD pipelines for Docker & Kubernetes

database backup and restore

manage dns and waf rules for security (cloudflare)

document steps

checking dashboard (Grafana) for system health

monitoring system during event (sync, large query)

responding to alert and incident

Investigating logs and metrics to debug system issues.

Conducting post-mortems and writing RCA (Root Cause Analysis).

Writing scripts (Bash, Python) for automating deployments, monitoring, and scaling. (backup, SSL renew,) ansible

Implementing auto-scaling mechanisms (Kubernetes HPA, Karpenter, AWS ASG).

chevron-rightFix build alerthashtag

  • Dependency Issues (package version not suitable, outdated version, missing environment variable)

  • build timeout (memory/disk full)

  • db connection issue

  • deployment speed reduction

explore new technologies and implement it

chevron-rightManage CICDhashtag

Role-based Access Control plugin

https://www.youtube.com/watch?v=HqfWj244x4Uarrow-up-right

Backup and restore

thin backup plugin

https://youtu.be/5Tb-AOUFuKQ?si=uMrEbQZPd13i6q0Earrow-up-right

Add agent

https://youtu.be/9RsmPNs7gT0arrow-up-right

update Jenkins

https://youtu.be/u4tIbFF4thM?si=wlxVNZEWd8d2l-lrarrow-up-right

Check Jenkins logs (/var/log/jenkins/jenkins.log or /var/lib/jenkins/logs).

Look for failing jobs and identify issues.

Monitor Jenkins UI: Manage Jenkins → System Information.

Restart Jenkins if needed (systemctl restart jenkins).

Add/remove users using RBAC (Role-Based Access Control).

Ensure plugins are up-to-date but avoid breaking changes.

Assign proper permissions for developers, QA, and admins.

Clean up old builds (Manage Jenkins → Build Discarder

Use ThinBackup or tar to back up /var/lib/jenkins/.

Delete old jobs & unused workspaces.

Reduce job execution time by caching dependencies.

database(postgres), caching(redis)

grafana (rbca) view, create panel

chevron-rightMnage prometheus and grafanahashtag

Monitor firing alerts in Alertmanager:

Fix failing alert rules in Prometheus Alerting Rules (rules.yml).

Delete old metrics if storage is full:

prometheus — add agent , setup alert

If targets show DOWN, restart Prometheus exporters:

Update Grafana dashboards (adjust panels, fix broken queries).

Optimize PromQL queries to reduce resource usage.

backup prometheus tsdb, grafana

Rotate Grafana admin passwords.

Delete old Prometheus data:

prometheus --storage.tsdb.retention.time=15d

Manage user roles & permissions (Manage Users → Org Roles).

Remove Grafana logs older than 30 days:

Upgrade Prometheus & Grafana

Secure Prometheus & Grafana with basic auth / OAuth / SSO.

build alert on teams (alerting — webhook email, teams)

manage cloud environment — AWS

manage services deployed on docker and compose

scanning containers and checking vulnerabilities

Learning about new trends in SRE & DevOps principles.

Last updated