Cluster Components Management

Task 7: Managing and Monitoring Kubernetes Cluster Critical Components

To maintain a healthy Kubernetes cluster, you must monitor and manage critical components like: ✅ API Server (Controls all operations) ✅ etcd (Stores cluster state) ✅ Controller Manager & Scheduler (Manages workloads) ✅ Kubelet & Kube Proxy (Handles nodes & networking)

Step 1: Monitor API Server Health

The API server is the most critical component. If it goes down, kubectl won’t work.

1️⃣ Check API Server Logs

kubectl logs -n kube-system -l component=kube-apiserver

Or on a master node:

journalctl -u kube-apiserver --no-pager | tail -50

2️⃣ Check API Server Health

kubectl get --raw='/readyz'

If the output is ok, the API server is healthy.

3️⃣ Restart API Server (If Unhealthy)

systemctl restart kube-apiserver

Or on a static pod setup:

docker restart $(docker ps | grep kube-apiserver | awk '{print $1}')

Common Issues & Solutions

Issue

Cause

Solution

kubectl get nodes hangs

API Server is down.

Restart API Server.

failed to connect to API Server

Firewall or network issue.

Check iptables -L and allow port 6443.

Step 2: Monitor etcd Health

Since etcd stores cluster state, monitoring it is crucial.

1️⃣ Check etcd Pod Status

kubectl get pods -n kube-system | grep etcd

2️⃣ Check etcd Health

ETCDCTL_API=3 etcdctl endpoint health \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

3️⃣ Check etcd Leader Election

ETCDCTL_API=3 etcdctl endpoint status --write-out=table

4️⃣ Restart etcd (If Needed)

systemctl restart etcd

Common Issues & Solutions

Issue

Cause

Solution

etcdctl: connection refused

etcd is down.

Restart etcd.

etcd database is corrupted

Data corruption.

Restore from snapshot.

Step 3: Monitor Controller Manager & Scheduler

The Controller Manager manages controllers, and the Scheduler schedules workloads.

1️⃣ Check Controller Manager Logs

kubectl logs -n kube-system -l component=kube-controller-manager

2️⃣ Check Scheduler Logs

kubectl logs -n kube-system -l component=kube-scheduler

3️⃣ Restart If Required

systemctl restart kube-controller-manager
systemctl restart kube-scheduler

Common Issues & Solutions

Issue

Cause

Solution

pods stuck in Pending

Scheduler issue.

Restart Scheduler.

failed to create deployment

Controller Manager issue.

Restart Controller Manager.

Step 4: Monitor Node Components (Kubelet & Proxy)

1️⃣ Check Kubelet Status on Nodes

systemctl status kubelet

If the Kubelet is not running, restart it:

systemctl restart kubelet

2️⃣ Check Kube Proxy

kubectl get pods -n kube-system | grep kube-proxy

To check logs:

kubectl logs -n kube-system -l k8s-app=kube-proxy

3️⃣ Restart Kube Proxy (If Needed)

systemctl restart kube-proxy

Common Issues & Solutions

Issue

Cause

Solution

Node Not Ready

Kubelet issue.

Restart Kubelet.

Pods can’t communicate

Kube Proxy issue.

Restart Kube Proxy.

Step 5: Use Prometheus & Grafana for Monitoring

For real-time monitoring, use Prometheus + Grafana.

1️⃣ Install Prometheus

kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/kube-prometheus/main/manifests/setup/prometheus-operator-deployment.yaml

2️⃣ Install Grafana

kubectl apply -f https://raw.githubusercontent.com/grafana/helm-charts/main/charts/grafana/templates/deployment.yaml

3️⃣ Expose Grafana

kubectl port-forward svc/grafana 3000:3000 -n monitoring

Now, access Grafana UI at http://localhost:3000.

Common Issues & Solutions

Issue

Cause

Solution

Grafana dashboard empty

Data source not added.

Add Prometheus as a data source.

Prometheus alerts missing

Misconfigured rules.

Reapply monitoring config.

Step 6: Set Up Alerts for Cluster Issues

1️⃣ Alert for High CPU Usage

Add this rule in Prometheus:

groups:
- name: CPUAlerts
  rules:
  - alert: HighCPUUsage
    expr: node_cpu_seconds_total{mode="idle"} < 20
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High CPU usage detected"

2️⃣ Alert for etcd Failure

- alert: etcdDown
  expr: up{job="etcd"} == 0
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: "etcd is down!"

Step 7: Regular Health Checks & Maintenance

✅ Check cluster status daily:

kubectl get componentstatus

✅ Check failed pods:

kubectl get pods --all-namespaces | grep CrashLoopBackOff

✅ Check node disk space:

df -h

✅ Automate alerts for failures.

Summary

✔ API Server monitored & restarted if needed ✔ etcd checked & restored from snapshot ✔ Controller Manager & Scheduler logs monitored ✔ Kubelet & Kube Proxy issues resolved ✔ Prometheus & Grafana set up for real-time monitoring ✔ Alerting system configured for failures

Next Task: Do you want to proceed with Cluster Security & Access Control? 😊

PreviousHigh Availability NextNode Management

Last updated 11 months ago

hashtagTask 7: Managing and Monitoring Kubernetes Cluster Critical Components

hashtagStep 1: Monitor API Server Health

hashtag1️⃣ Check API Server Logs

hashtag2️⃣ Check API Server Health

hashtag3️⃣ Restart API Server (If Unhealthy)

hashtagStep 2: Monitor etcd Health

hashtag1️⃣ Check etcd Pod Status

hashtag2️⃣ Check etcd Health

hashtag3️⃣ Check etcd Leader Election

hashtag4️⃣ Restart etcd (If Needed)

hashtagStep 3: Monitor Controller Manager & Scheduler

hashtag1️⃣ Check Controller Manager Logs

hashtag2️⃣ Check Scheduler Logs

hashtag3️⃣ Restart If Required

hashtagStep 4: Monitor Node Components (Kubelet & Proxy)

hashtag1️⃣ Check Kubelet Status on Nodes

hashtag2️⃣ Check Kube Proxy

hashtag3️⃣ Restart Kube Proxy (If Needed)

hashtagStep 5: Use Prometheus & Grafana for Monitoring

hashtag1️⃣ Install Prometheus

hashtag2️⃣ Install Grafana

hashtag3️⃣ Expose Grafana

hashtagStep 6: Set Up Alerts for Cluster Issues

hashtag1️⃣ Alert for High CPU Usage

hashtag2️⃣ Alert for etcd Failure

hashtagStep 7: Regular Health Checks & Maintenance

hashtagSummary

hashtagNext Task: Do you want to proceed with Cluster Security & Access Control? 😊

Task 7: Managing and Monitoring Kubernetes Cluster Critical Components

Step 1: Monitor API Server Health

1️⃣ Check API Server Logs

2️⃣ Check API Server Health

3️⃣ Restart API Server (If Unhealthy)

Step 2: Monitor etcd Health

1️⃣ Check etcd Pod Status

2️⃣ Check etcd Health

3️⃣ Check etcd Leader Election

4️⃣ Restart etcd (If Needed)

Step 3: Monitor Controller Manager & Scheduler

1️⃣ Check Controller Manager Logs

2️⃣ Check Scheduler Logs

3️⃣ Restart If Required

Step 4: Monitor Node Components (Kubelet & Proxy)

1️⃣ Check Kubelet Status on Nodes

2️⃣ Check Kube Proxy

3️⃣ Restart Kube Proxy (If Needed)

Step 5: Use Prometheus & Grafana for Monitoring

1️⃣ Install Prometheus

2️⃣ Install Grafana

3️⃣ Expose Grafana

Step 6: Set Up Alerts for Cluster Issues

1️⃣ Alert for High CPU Usage

2️⃣ Alert for etcd Failure

Step 7: Regular Health Checks & Maintenance

Summary

Next Task: Do you want to proceed with Cluster Security & Access Control? 😊