Node Management

Complete Guide to Node Management in AWS EKS

Managing nodes effectively in an EKS cluster ensures optimal scalability, security, and performance. Below is a detailed approach covering node lifecycle management, draining, scaling, troubleshooting, and common issues with solutions.

🛠 Step 1: Check Node Health & Status

Before making any changes, always verify node health.

kubectl get nodes -o wide

🔹 STATUS should be Ready 🔹 Check node details:

kubectl describe node <node-name>

🔹 Check node resource usage:

kubectl top nodes

✅ If a node is in NotReady, check logs to diagnose issues.

🚀 Step 2: Add New Nodes (Scaling Up)

Option 1: Managed Node Group (Recommended)

AWS Managed Node Groups handle upgrades and scaling automatically.

🔹 Add a new node group:

eksctl create nodegroup --cluster my-cluster --name new-nodegroup \
  --node-type t3.medium --nodes 2 --nodes-min 1 --nodes-max 3 --managed

✅ Nodes are automatically added to the cluster.

Option 2: Self-Managed Nodes

1️⃣ Launch an EC2 instance with an EKS-compatible AMI

aws ec2 run-instances --image-id ami-xxxxxx --count 2 --instance-type t3.medium

2️⃣ Join the new nodes to the cluster

aws eks update-kubeconfig --name my-cluster
kubectl apply -f aws-auth-cm.yaml

✅ Verify the new nodes:

kubectl get nodes

📉 Step 3: Drain and Remove Nodes (Scaling Down)

Before removing a node, safely drain it to avoid workload disruption.

Step 1: Cordon the Node (Prevent New Pods)

kubectl cordon <node-name>

✅ The node will not accept new pods.

Step 2: Drain the Node (Move Running Workloads)

kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data

✅ This gracefully evicts running pods before removal. ✅ DaemonSet pods (e.g., logging/monitoring) remain.

Common Issue: "Cannot evict pod" error 🔹 Solution: Force drain with --force

kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data --force

Step 3: Delete the Node from the Cluster

🔹 For Managed Node Group:

eksctl delete nodegroup --cluster my-cluster --name old-nodegroup

🔹 For Self-Managed Nodes:

kubectl delete node <node-name>

🔹 Terminate the EC2 instance (if applicable):

aws ec2 terminate-instances --instance-ids i-xxxxxxx

✅ Verify node removal:

kubectl get nodes

📦 Step 4: Upgrade Nodes (Rolling Update)

To upgrade nodes for security patches:

Step 1: Get Latest Amazon EKS AMI

aws ssm get-parameter --name /aws/service/eks/optimized-ami/1.29/amazon-linux-2/recommended/image_id --query "Parameter.Value" --output text

Step 2: Upgrade Nodes

🔹 Managed Node Groups (Rolling Update):

eksctl upgrade nodegroup --cluster my-cluster --name my-nodegroup

🔹 Self-Managed Nodes: 1️⃣ Launch new nodes with the latest AMI. 2️⃣ Drain old nodes:

kubectl drain <old-node-name> --ignore-daemonsets --delete-emptydir-data

3️⃣ Terminate old instances and update aws-auth-cm.yaml.

✅ Ensure nodes are healthy post-upgrade:

kubectl get nodes

⚠️ Step 5: Troubleshooting Common Node Issues

Below are common node-related issues and their fixes.

❌ Issue 1: Nodes in `NotReady` State

🔹 Check node status

kubectl describe node <node-name>

🔹 Check if kubelet is running

journalctl -u kubelet -f

✅ Solution: Restart kubelet

systemctl restart kubelet

✅ Solution: If disk is full, clean up logs

du -sh /var/log
rm -rf /var/log/*

❌ Issue 2: New Pods Stuck in `Pending`

🔹 Check node capacity:

kubectl describe node <node-name> | grep Allocatable -A10

🔹 Check if there are taints:

kubectl describe node <node-name> | grep Taint

✅ Solution: Remove unwanted taints

kubectl taint nodes <node-name> key=value:NoSchedule-

❌ Issue 3: Node Fails to Join Cluster

🔹 Check logs for join errors:

cat /var/log/messages | grep kubelet

🔹 Check if IAM role is missing permissions

aws eks describe-cluster --name my-cluster --query cluster.resourcesVpcConfig.securityGroupIds

✅ Solution: Attach proper IAM policies.

❌ Issue 4: Disk Pressure Warning

🔹 Check disk usage:

df -h

🔹 Clean up unused Docker images

docker system prune -a

❌ Issue 5: Nodes Stuck in `Terminating`

🔹 Check if node.kubernetes.io/unreachable taint is present:

kubectl describe node <node-name>

✅ Solution: Manually delete the node

kubectl delete node <node-name> --force --grace-period=0

📌 Summary

✔ Monitor and manage node health ✔ Add new nodes using eksctl or manually ✔ Drain and delete nodes properly to prevent downtime ✔ Upgrade nodes with a rolling update strategy ✔ Troubleshoot node issues efficiently

🚀 NEXT: Do you want to configure Cluster Autoscaler or Karpenter for auto-scaling nodes?

PreviousCluster Components Management NextRBAC

Last updated 11 months ago

hashtagComplete Guide to Node Management in AWS EKS

hashtag🛠 Step 1: Check Node Health & Status

hashtag🚀 Step 2: Add New Nodes (Scaling Up)

hashtagOption 1: Managed Node Group (Recommended)

hashtagOption 2: Self-Managed Nodes

hashtag📉 Step 3: Drain and Remove Nodes (Scaling Down)

hashtagStep 1: Cordon the Node (Prevent New Pods)

hashtagStep 2: Drain the Node (Move Running Workloads)

hashtagStep 3: Delete the Node from the Cluster

hashtag📦 Step 4: Upgrade Nodes (Rolling Update)

hashtagStep 1: Get Latest Amazon EKS AMI

hashtagStep 2: Upgrade Nodes

hashtag⚠️ Step 5: Troubleshooting Common Node Issues

hashtag❌ Issue 1: Nodes in NotReady State

hashtag❌ Issue 2: New Pods Stuck in Pending

hashtag❌ Issue 3: Node Fails to Join Cluster

hashtag❌ Issue 4: Disk Pressure Warning

hashtag❌ Issue 5: Nodes Stuck in Terminating

hashtag📌 Summary