GPU Autoscaling on EKS: Scaling LLM Inference Workloads with Karpenter in Production

By Aditya Krishnakumar

Elevator Pitch

Learn how Karpenter transforms GPU autoscaling on Amazon EKS. We share battle-tested strategies from a production deployment combining LLM inference, image processing, and Temporal orchestration. Provision GPU nodes on-demand, optimize resource utilization, and reduce costs.

Description

LLM inference and image processing workloads demand high-performance GPU infrastructure, but GPU costs can spiral quickly without intelligent autoscaling. This talk explores how Karpenter transforms GPU provisioning on Amazon EKS.

Drawing from a production deployment, we’ll share practical strategies for:

•⁠ ⁠Configuring Karpenter NodePools for GPU workloads with instance family diversification •⁠ ⁠Integrating GPU scaling with Temporal workflow orchestration •⁠ ⁠Optimizing cost per inference through efficient resource bin-packing •⁠ ⁠Handling cold starts and scaling latency in time-sensitive workloads •⁠ ⁠Monitoring and debugging autoscaling behavior

You’ll learn patterns that work in practice - failures we encountered, how we solved them, and configuration best practices that prevent costly mistakes. By the end, you’ll understand how to design EKS clusters that scale AI workloads intelligently and cost-effectively.

Notes

Key Takeaways 1.⁠ ⁠How Karpenter enables faster, more efficient GPU node provisioning than traditional cluster autoscalers 2.⁠ ⁠Practical patterns for integrating autoscaling with ML pipeline orchestration tools (Temporal) 3.⁠ ⁠Cost optimisation techniques and monitoring strategies for GPU-accelerated inference 4.⁠ ⁠Common pitfalls and how to avoid them