Orchestrating SageMaker HyperPod clusters with Slurm
Slurm support in SageMaker HyperPod helps you provision resilient clusters for running machine learning (ML) workloads and developing state-of-the-art models such as large language models (LLMs), diffusion models, and foundation models (FMs). It accelerates development of FMs by removing undifferentiated heavy-lifting involved in building and maintaining large-scale compute clusters powered by thousands of accelerators such as AWS Trainium and NVIDIA A100 and H100 Graphical Processing Units (GPUs). When accelerators fail, the resiliency features of SageMaker HyperPod monitors the cluster instances automatically detect and replace the faulty hardware on the fly so that you can focus on running ML workloads. Additionally, with lifecycle configuration support in SageMaker HyperPod, you can customize your computing environment to best suit your needs and configure it with the Amazon SageMaker distributed training libraries to achieve optimal performance on AWS.
Operating clusters
You can create, configure, and maintain SageMaker HyperPod clusters graphically through the console user interface (UI) and programmatically through the AWS command line interface (CLI) or AWS SDK for Python (Boto3). With Amazon VPC, you can secure the cluster network and also take advantage of configuring your cluster with resources in your VPC, such as Amazon FSx for Lustre, which offers the fastest throughput. You can also give different IAM roles to cluster instance groups, and limit actions that your cluster resources and users can operate. To learn more, see SageMaker HyperPod operation.
Configuring your ML environment
SageMaker HyperPod runs SageMaker HyperPod DLAMI, which sets up an ML environment on the HyperPod clusters. You can configure additional customizations to the DLAMI by providing lifecycle scripts to support your use case. To learn more about how to set up lifecycle scripts, see Tutorial for getting started with SageMaker HyperPod and Customize SageMaker HyperPod clusters using lifecycle scripts.
Scheduling jobs
After you successfully create a HyperPod cluster, cluster users can log into the cluster nodes (such as head or controller node, log-in node, and worker node) and schedule jobs for running machine learning workloads. To learn more, see Jobs on SageMaker HyperPod clusters.
Resiliency against hardware failures
SageMaker HyperPod runs health checks on cluster nodes and provides a workload auto-resume functionality. With the cluster resiliency features of HyperPod, you can resume your workload from the last checkpoint you saved, after faulty nodes are replaced with healthy ones in clusters with more than 16 nodes. To learn more, see SageMaker HyperPod cluster resiliency.
Logging and managing clusters
You can find SageMaker HyperPod resource utilization metrics and lifecycle logs in Amazon CloudWatch, and
manage SageMaker HyperPod resources by tagging them. Each CreateCluster
API run
creates a distinct log stream, named in <cluster-name>-<timestamp>
format. In the log stream, you can check the host names, the name of failed lifecycle
scripts, and outputs from the failed scripts such as stdout
and
stderr
. For more information, see SageMaker HyperPod cluster management.
Compatible with SageMaker tools
Using SageMaker HyperPod, you can configure clusters with AWS optimized collective
communications libraries offered by SageMaker, such as the SageMaker
distributed data parallelism (SMDDP) library. The SMDDP library implements the
AllGather
operation optimized to the AWS compute and network
infrastructure for the most performant SageMaker machine learning instances powered by NVIDIA
A100 GPUs. To learn more, see Run
distributed training workloads with Slurm on HyperPod.
Topics
- Tutorial for getting started with SageMaker HyperPod
- SageMaker HyperPod operation
- Customize SageMaker HyperPod clusters using lifecycle scripts
- Jobs on SageMaker HyperPod clusters
- SageMaker HyperPod cluster resources monitoring
- SageMaker HyperPod cluster resiliency
- SageMaker HyperPod cluster management
- SageMaker HyperPod FAQ