Challenges

Sure Pavan! Here's a mock interview-style scenario focused on your Karpenter integration in your Kubernetes/EKS setup β€” highlighting what you did, challenges you faced, how you solved them, and what you achieved, without repeating previous patterns.


πŸš€ Context

You are explaining to the interviewer how you migrated from Cluster Autoscaler (or manual node scaling) to Karpenter for dynamic and cost-efficient node provisioning.


πŸ‘€ Interviewer: You mentioned implementing Karpenter in your EKS cluster. Can you walk me through what triggered the migration and how you handled it?


πŸ‘¨β€πŸ’» Pavan: Sure. Our workloads were scaling fast, but our Cluster Autoscaler was slow to react, especially during traffic spikes. We were also bound to predefined node groups and instance types, which wasn’t optimal in terms of cost or flexibility.

So, we introduced Karpenter to dynamically provision EC2 instances based on real-time pod demands. The first step was creating the necessary IAM roles and EC2NodeClass, and ensuring our EKS cluster had a valid OIDC provider setup. We deployed Karpenter using Helm with custom values, including spot and on-demand configurations.


πŸ‘€ Interviewer: Nice. What challenges did you face during that rollout?


πŸ‘¨β€πŸ’» Pavan: One key challenge was understanding how Karpenter handles capacity provisioning differently. It doesn’t work like Cluster Autoscaler where you scale node groups. Instead, it uses provisioners and node templates to spin up exactly what’s needed.

Initially, we faced excess node provisioning due to misconfigured consolidation settings and missing taints/tolerations. We solved this by refining our Provisioner to include:

  • Specific requirements for instance families (e.g., m5, c6a)

  • Spot instance preference with fallback

  • Enable consolidation: true and TTLSecondsAfterEmpty for idle cleanup

We also tagged the subnets and security groups correctly to avoid scheduling failures.


πŸ‘€ Interviewer: Did you integrate it with workload-specific needs, like GPU or memory-intensive workloads?


πŸ‘¨β€πŸ’» Pavan: Yes, and that’s where Karpenter really helped. For example, our AI workloads required GPU instances, and earlier we kept a separate node group always running, which was expensive.

With Karpenter, we defined a separate Provisioner with requirements for nvidia.com/gpu and restricted it to GPU instance types like g4dn.xlarge. This way, nodes were launched on-demand, only when GPU pods were scheduled, and terminated automatically when idle.

This brought a 40–50% cost reduction for those workloads alone.


πŸ‘€ Interviewer: How did you test the reliability of Karpenter in production?


πŸ‘¨β€πŸ’» Pavan: We followed a phased rollout strategy. First, we kept Cluster Autoscaler as-is and labeled a few deployments with karpenter.sh/provisioner-name. We observed how nodes were provisioned.

We also generated fake loads using KEDA, verified scaling behavior, pod-to-node fit, and termination grace periods. Only after 2 weeks of stable behavior, we disabled Cluster Autoscaler and made Karpenter the primary provisioning tool.


πŸ‘€ Interviewer: Did you face any scaling or interruption issues with spot instances?


πŸ‘¨β€πŸ’» Pavan: Yes, initially we didn’t handle spot instance interruptions gracefully. Pods would get evicted during AWS spot interruptions.

To fix this, we:

  • Deployed the Karpenter interruption handler

  • Enabled pod disruption budgets (PDBs) for critical services

  • Configured preemption tolerations and added topologySpreadConstraints for zone distribution

Now, even during interruptions, our apps reschedule quickly and maintain SLOs.


πŸ‘€ Interviewer: Sounds impressive. What’s the overall outcome or impact of this migration?


πŸ‘¨β€πŸ’» Pavan: The impact was significant:

βœ… Reduced provisioning time from ~2 minutes (Cluster Autoscaler) to under 30 seconds βœ… 40–60% cost savings by dynamically using spot + right-sized instances βœ… More reliable auto-scaling, especially for bursty workloads βœ… Simplified infra management β€” no more managing dozens of node groups βœ… Improved developer experience β€” devs just declare resource needs, and Karpenter handles the infra


Would you like me to turn this into a crisp one-pager or a presentation-style deck for you to keep handy during interviews?

Last updated