Scaling using Karpenter
Utilizing Spot instance at 60% discount rate for scaled nodes
Deployed using helm chart
π₯ Problems before Karpenter:
~EC2 Auto Scaling Groups required predefining instance types and capacities
~Underutilized nodes due to static sizing and slow scaling decisions
~Provisioning delays β new nodes took time to launch and register
~High AWS costs due to over-provisioned capacity during traffic spikes
~Manual management of ASG launch templates for multiple workloads
~Node fragmentation β pods failed to schedule even with idle resources
~Inflexible scaling during rapid CI/CD deployments or workload bursts
~Hard to optimize for Spot + On-Demand mix without complexity
π οΈ Solutions using Karpenter:
β Installed Karpenter via Helm with IAM roles and OIDC trust on our EKS cluster
β Defined Provisioners based on workload needs (e.g., general, GPU, Spot)
β Enabled automatic node provisioning with real-time pod-driven scaling
β Used consolidation to automatically downsize and bin-pack underutilized nodes
β Allowed Karpenter to choose optimal instance types and sizes across Spot and On-Demand
β Tuned
ttlSecondsAfterEmptyfor faster scale-down of idle nodesβ Used taints, tolerations, and labels to isolate workloads by team/priority
β Integrated with Prometheus and CloudWatch for cost visibility and scaling metrics
β
Hereβs a summary of what we achieved:
β ~45% cost reduction by leveraging Spot instances dynamically
β Improved pod scheduling speed β workloads spin up in under a minute
β ~30% better resource utilization via bin-packing and consolidation
β Eliminated need for ASG complexity β no static instance mapping
β Highly responsive scaling based on actual pod needs
β Faster rollout of high-load jobs β CI/CD jobs no longer queue for resources
β Simplified ops β no more managing launch templates or capacity planning
β Flexible provisioning β different instance types per workload with minimal config
Last updated