Autoscaling on SageMaker HyperPod EKS
Amazon SageMaker HyperPod provides a managed Karpenter based node autoscaling solution for clusters
created with EKS orchestration. Karpenter
You pay only for what you use. You're responsible for paying for all compute instances
that are automatically provisioned through autoscaling according to standard SageMaker HyperPod
pricing. For detailed pricing information, see Amazon SageMaker AI
By enabling Karpenter-based autoscaling with HyperPod, you have access to:
-
Service managed lifecycle - HyperPod handles Karpenter installation, updates, and maintenance, eliminating operational overhead.
-
Just in time provisioning - Karpenter will observe your pending pods and provision the required compute for your workloads from on-demand pool.
-
Scale to zero - Scale down to zero nodes without maintaining dedicated controller infrastructure.
-
Workload aware node selection - Karpenter chooses optimal instance types based on pod requirements, availability zones, and pricing to minimize costs.
-
Automatic node consolidation - Karpenter regularly evaluates cluster for optimization opportunities, shifting workloads to eliminate underutilized nodes.
-
Integrated resilience - Leverages HyperPod's built-in fault tolerance and node recovery mechanisms.
The following topics explain how to enable HyperPod autoscaling with Karpenter.
Topics
Prerequisites
-
Continuous provisioning enabled on your HyperPod cluster. Enable continuous provisioning by setting
--node-provisioning-mode
toContinuous
when creating your SageMaker HyperPod cluster. For more information, see Continuous provisioning for enhanced cluster operations on Amazon EKS. -
Health Monitoring Agent version 1.0.742.0_1.0.241.0 or above installed. Required for HyperPod cluster operations and monitoring. The agent must be configured before enabling Karpenter autoscaling to ensure proper cluster health reporting and node lifecycle management. For more information, see SageMaker HyperPod health-monitoring agent.
-
Only if your Amazon EKS cluster has Karpenter running on it, the Karpenter
NodePool
andNodeClaim
versions need to be v1. -
NodeRecovery
set to automatic. For more information, see Automatic node recovery.