Amazon SageMaker HyperPod - Amazon SageMaker

Amazon SageMaker HyperPod

SageMaker HyperPod helps you provision resilient clusters for running machine learning (ML) workloads and developing state-of-the-art models such as large language models (LLMs), diffusion models, and foundation models (FMs). It accelerates development of FMs by removing undifferentiated heavy-lifting involved in building and maintaining large-scale compute clusters powered by thousands of accelerators such as AWS Trainium and NVIDIA A100 and H100 Graphical Processing Units (GPUs). When accelerators fail, the resiliency features of SageMaker HyperPod monitor the cluster instances automatically detect and replace the faulty hardware on the fly so that you can focus on running ML workloads.

To get started, check Prerequisites for using SageMaker HyperPod, set up AWS Identity and Access Management for SageMaker HyperPod, and choose one of the following orchestrator options supported by SageMaker HyperPod.

Slurm support in SageMaker HyperPod

SageMaker HyperPod provides support for running machine learning workloads on resilient clusters by integrating with Slurm, an open-source workload manager. Slurm support in SageMaker HyperPod enables seamless cluster orchestration through Slurm cluster configuration, allowing you to set up head, login, and worker nodes on the SageMaker HyperPod clusters This integration also facilitates Slurm-based job scheduling for running ML workloads on the cluster, as well as direct access to cluster nodes for job scheduling. With HyperPod's lifecycle configuration support, you can customize the computing environment of the clusters to meet your specific requirements. Additionally, by leveraging the Amazon SageMaker distributed training libraries, you can optimize the clusters' performance on AWS computing and network resources. To learn more, see Orchestrating SageMaker HyperPod clusters with Slurm.

Amazon EKS support in SageMaker HyperPod

SageMaker HyperPod also integrates with Amazon EKS to enable large-scale training of foundation models on long-running and resilient compute clusters. This allows cluster admin users to provision HyperPod clusters and attach them to an EKS control plane, enabling dynamic capacity management, direct access to cluster instances, and resiliency capabilities. For data scientists, Amazon EKS support in HyperPod allows running containerized workloads for training foundation models, inference on the EKS cluster, and leveraging the job auto-resume capability for Kubeflow PyTorch training. The architecture involves a 1-to-1 mapping between an EKS cluster (control plane) and a HyperPod cluster (worker nodes) within a VPC, providing a tightly integrated solution for running large-scale ML workloads. To learn more, see Orchestrating SageMaker HyperPod clusters with Amazon EKS.

AWS Regions supported by SageMaker HyperPod

SageMaker HyperPod is available in the following AWS Regions.

  • us-east-1

  • us-east-2

  • us-west-2

  • eu-central-1

  • eu-west-1

  • eu-north-1

  • ap-southeast-1

  • ap-southeast-2

  • ap-northeast-1