Recovery

Task 6: Disaster Recovery in Kubernetes

Disaster recovery (DR) ensures that a Kubernetes cluster can be restored quickly in case of failure. This involves: ✅ Backing up etcd (critical cluster data) ✅ Restoring etcd in case of failure ✅ Deploying workloads to a new cluster automatically ✅ Automating disaster recovery with scripts

Step 1: Backup etcd Data

Since etcd stores the state of the cluster, backing it up is crucial.

1️⃣ Check the Status of `etcd`

kubectl get pods -n kube-system | grep etcd

If etcd is running as a pod in a static control plane setup, find the container name:

docker ps | grep etcd

2️⃣ Take an `etcd` Snapshot

Run this on a control plane node:

ETCDCTL_API=3 etcdctl snapshot save /var/lib/etcd/backup.db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

3️⃣ Verify Backup

ETCDCTL_API=3 etcdctl snapshot status /var/lib/etcd/backup.db

Step 2: Restore etcd in Case of Failure

If the control plane node crashes, restore from the snapshot.

1️⃣ Stop the Kubernetes API Server

systemctl stop kube-apiserver

2️⃣ Restore the etcd Snapshot

ETCDCTL_API=3 etcdctl snapshot restore /var/lib/etcd/backup.db \
  --data-dir /var/lib/etcd-new

Replace the existing etcd data directory:

mv /var/lib/etcd /var/lib/etcd-old
mv /var/lib/etcd-new /var/lib/etcd
chown -R etcd:etcd /var/lib/etcd

3️⃣ Restart `etcd` and API Server

systemctl restart etcd
systemctl restart kube-apiserver

Common Issues & Solutions

Issue

Cause

Solution

etcdctl: connection refused

etcd is down.

Restart etcd: systemctl restart etcd.

failed to restore snapshot

Incorrect permissions.

Run chown -R etcd:etcd /var/lib/etcd.

cluster is unhealthy

Incorrect snapshot.

Use a valid snapshot and retry.

Step 3: Automate Disaster Recovery

If the cluster fails completely, you need an automated way to: ✅ Create a new cluster ✅ Restore workloads from backup

1️⃣ Automate Cluster Re-Creation Using Terraform

If you deployed Kubernetes using Terraform, re-create the cluster:

terraform destroy -auto-approve
terraform apply -auto-approve

2️⃣ Restore Workloads from `kubectl` Backup

If you have exported all cluster objects:

kubectl apply -f cluster-backup.yaml

Step 4: DR for EKS Clusters

For EKS, use an automated infrastructure backup.

1️⃣ Backup EKS Cluster Configuration

eksctl get cluster --name my-cluster -o yaml > eks-cluster-backup.yaml

2️⃣ Restore an EKS Cluster in a New Region

eksctl create cluster -f eks-cluster-backup.yaml

3️⃣ Restore Persistent Volumes

If using EBS-backed volumes:

aws ec2 create-volume --snapshot-id <snapshot-id> --availability-zone us-east-2a

Common Issues & Solutions

Issue

Cause

Solution

eksctl: cluster already exists

Duplicate cluster.

Use eksctl delete cluster first.

failed to mount volume

Volume lost in failure.

Restore from EBS snapshot.

Step 5: Validate Disaster Recovery

After restoring: ✅ Check Cluster State:

kubectl get nodes
kubectl get pods -A

✅ Verify Workloads are Running:

kubectl get deployments -A

✅ Check etcd Health:

ETCDCTL_API=3 etcdctl endpoint health

Summary

✅ etcd backup taken ✅ Cluster restored from etcd snapshot ✅ Automated cluster re-creation using Terraform ✅ EKS disaster recovery with eksctl ✅ Validated cluster recovery

Next Task: Do you want to proceed with Monitoring & Logging for Disaster Detection? 😊

PreviousBackup NextHigh Availability

Last updated 11 months ago

hashtagTask 6: Disaster Recovery in Kubernetes

hashtagStep 1: Backup etcd Data

hashtag1️⃣ Check the Status of etcd

hashtag2️⃣ Take an etcd Snapshot

hashtag3️⃣ Verify Backup

hashtagStep 2: Restore etcd in Case of Failure

hashtag1️⃣ Stop the Kubernetes API Server

hashtag2️⃣ Restore the etcd Snapshot

hashtag3️⃣ Restart etcd and API Server

hashtagStep 3: Automate Disaster Recovery

hashtag1️⃣ Automate Cluster Re-Creation Using Terraform

hashtag2️⃣ Restore Workloads from kubectl Backup

hashtagStep 4: DR for EKS Clusters

hashtag1️⃣ Backup EKS Cluster Configuration

hashtag2️⃣ Restore an EKS Cluster in a New Region

hashtag3️⃣ Restore Persistent Volumes

hashtagStep 5: Validate Disaster Recovery

hashtagSummary

hashtagNext Task: Do you want to proceed with Monitoring & Logging for Disaster Detection? 😊