Day to Day Life

manage container workload on docker and Kubernetes

manage container deployment

manage Linux environment

disk cleanup
memory
CPU
debugging — logging — nginx, container

manage nginx and certbot

Clean up old/unused Docker images and containers.

Scan Docker images for vulnerabilities (Trivy, AWS scan)

Manage Kubernetes objects: Deployments, Services, Ingress, ConfigMaps, Secrets.

Monitor Pod health using kubectl get pods -o wide.

Identify & resolve failed pods

Maintain CI/CD pipelines for Docker & Kubernetes

database backup and restore

manage dns and waf rules for security (cloudflare)

document steps

checking dashboard (Grafana) for system health

monitoring system during event (sync, large query)

responding to alert and incident

Investigating logs and metrics to debug system issues.

Conducting post-mortems and writing RCA (Root Cause Analysis).

Writing scripts (Bash, Python) for automating deployments, monitoring, and scaling. (backup, SSL renew,) ansible

Implementing auto-scaling mechanisms (Kubernetes HPA, Karpenter, AWS ASG).

Fix build alert

Dependency Issues (package version not suitable, outdated version, missing environment variable)
build timeout (memory/disk full)
db connection issue
deployment speed reduction

explore new technologies and implement it

Manage CICD

Role-based Access Control plugin

https://www.youtube.com/watch?v=HqfWj244x4U

Backup and restore

thin backup plugin

https://youtu.be/5Tb-AOUFuKQ?si=uMrEbQZPd13i6q0E

Add agent

https://youtu.be/9RsmPNs7gT0

update Jenkins

https://youtu.be/u4tIbFF4thM?si=wlxVNZEWd8d2l-lr

Check Jenkins logs (/var/log/jenkins/jenkins.log or /var/lib/jenkins/logs).

Look for failing jobs and identify issues.

Monitor Jenkins UI: Manage Jenkins → System Information.

Restart Jenkins if needed (systemctl restart jenkins).

Add/remove users using RBAC (Role-Based Access Control).

Ensure plugins are up-to-date but avoid breaking changes.

Assign proper permissions for developers, QA, and admins.

Clean up old builds (Manage Jenkins → Build Discarder

Use ThinBackup or tar to back up /var/lib/jenkins/.

Delete old jobs & unused workspaces.

Reduce job execution time by caching dependencies.

database(postgres), caching(redis)

grafana (rbca) view, create panel

Mnage prometheus and grafana

Monitor firing alerts in Alertmanager:

Fix failing alert rules in Prometheus Alerting Rules (rules.yml).

Delete old metrics if storage is full:

prometheus — add agent , setup alert

If targets show DOWN, restart Prometheus exporters:

Update Grafana dashboards (adjust panels, fix broken queries).

Optimize PromQL queries to reduce resource usage.

backup prometheus tsdb, grafana

Rotate Grafana admin passwords.

Delete old Prometheus data:

prometheus --storage.tsdb.retention.time=15d

Manage user roles & permissions (Manage Users → Org Roles).

Remove Grafana logs older than 30 days:

Upgrade Prometheus & Grafana

Secure Prometheus & Grafana with basic auth / OAuth / SSO.

build alert on teams (alerting — webhook email, teams)

manage cloud environment — AWS

manage services deployed on docker and compose

scanning containers and checking vulnerabilities

Learning about new trends in SRE & DevOps principles.

PreviousDevOps DAY NextAchievemnt

Last updated 10 months ago

hashtagmanage Linux environment

hashtag

manage Linux environment