Recovery

Task 6: Disaster Recovery in Kubernetes

Disaster recovery (DR) ensures that a Kubernetes cluster can be restored quickly in case of failure. This involves: ✅ Backing up etcd (critical cluster data) ✅ Restoring etcd in case of failureDeploying workloads to a new cluster automaticallyAutomating disaster recovery with scripts


Step 1: Backup etcd Data

Since etcd stores the state of the cluster, backing it up is crucial.

1️⃣ Check the Status of etcd

kubectl get pods -n kube-system | grep etcd

If etcd is running as a pod in a static control plane setup, find the container name:

docker ps | grep etcd

2️⃣ Take an etcd Snapshot

Run this on a control plane node:

ETCDCTL_API=3 etcdctl snapshot save /var/lib/etcd/backup.db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

3️⃣ Verify Backup


Step 2: Restore etcd in Case of Failure

If the control plane node crashes, restore from the snapshot.

1️⃣ Stop the Kubernetes API Server

2️⃣ Restore the etcd Snapshot

Replace the existing etcd data directory:

3️⃣ Restart etcd and API Server

Common Issues & Solutions

Issue
Cause
Solution

etcdctl: connection refused

etcd is down.

Restart etcd: systemctl restart etcd.

failed to restore snapshot

Incorrect permissions.

Run chown -R etcd:etcd /var/lib/etcd.

cluster is unhealthy

Incorrect snapshot.

Use a valid snapshot and retry.


Step 3: Automate Disaster Recovery

If the cluster fails completely, you need an automated way to: ✅ Create a new clusterRestore workloads from backup

1️⃣ Automate Cluster Re-Creation Using Terraform

If you deployed Kubernetes using Terraform, re-create the cluster:

2️⃣ Restore Workloads from kubectl Backup

If you have exported all cluster objects:


Step 4: DR for EKS Clusters

For EKS, use an automated infrastructure backup.

1️⃣ Backup EKS Cluster Configuration

2️⃣ Restore an EKS Cluster in a New Region

3️⃣ Restore Persistent Volumes

If using EBS-backed volumes:

Common Issues & Solutions

Issue
Cause
Solution

eksctl: cluster already exists

Duplicate cluster.

Use eksctl delete cluster first.

failed to mount volume

Volume lost in failure.

Restore from EBS snapshot.


Step 5: Validate Disaster Recovery

After restoring: ✅ Check Cluster State:

Verify Workloads are Running:

Check etcd Health:


Summary

etcd backup takenCluster restored from etcd snapshotAutomated cluster re-creation using TerraformEKS disaster recovery with eksctlValidated cluster recovery


Next Task: Do you want to proceed with Monitoring & Logging for Disaster Detection? 😊

Last updated