SageMaker HyperPod

SageMaker HyperPod helps you provision resilient clusters for running machine learning (ML) workloads and developing state-of-the-art models such as large language models (LLMs), diffusion models, and foundation models (FMs). It accelerates development of FMs by removing undifferentiated heavy-lifting involved in building and maintaining large-scale compute clusters powered by thousands of accelerators such as AWS Trainium and NVIDIA A100 and H100 Graphical Processing Units (GPUs). When accelerators fail, self-healing clusters automatically detect and replace the faulty hardware on the fly so that you can focus on running ML workloads for weeks and months without disruption. Additionally, with SageMaker HyperPod, you can customize your computing environment to best suit your needs and configure it with the Amazon SageMaker distributed training libraries to achieve optimal performance on AWS.

Operating clusters

You can create, conﬁgure, and maintain SageMaker HyperPod clusters graphically through the console user interface (UI) and programmatically through the AWS command line interface (CLI) or AWS SDK for Python (Boto3). With Amazon VPC, you can secure the cluster network and also take advantage of configuring your cluster with resources in your VPC, such as Amazon FSx for Lustre, which offers the fastest throughput. You can also give different IAM roles to cluster instance groups, and limit actions that your cluster resources and users can operate. To learn more, see Operate SageMaker HyperPod.

Configuring your ML environment

SageMaker HyperPod runs SageMaker HyperPod DLAMI, which sets up an ML environment on the HyperPod clusters. You can configure additional customizations to the DLAMI by providing lifecycle scripts to support your use case. To learn more about how to set up lifecycle scripts, see Getting started with SageMaker HyperPod and SageMaker HyperPod lifecycle configuration best practices.

Scheduling jobs

After you successfully create a HyperPod cluster, cluster users can log into the cluster nodes (such as head or controller node, log-in node, and worker node) and schedule jobs for running machine learning workloads. To learn more, see Run jobs on SageMaker HyperPod clusters.

Resiliency against hardware failures

SageMaker HyperPod runs health checks on cluster nodes and provides a workload auto-resume functionality. With the cluster resiliency features of HyperPod, you can resume your workload from the last checkpoint you saved, after faulty nodes are replaced with healthy ones in clusters with more than 16 nodes. To learn more, see SageMaker HyperPod cluster resiliency.

Logging and managing clusters

You can find SageMaker HyperPod resource utilization metrics and lifecycle logs in Amazon CloudWatch, and manage SageMaker HyperPod resources by tagging them. Each CreateCluster API run creates a distinct log stream, named in <cluster-name>-<timestamp> format. In the log stream, you can check the host names, the name of failed lifecycle scripts, and outputs from the failed scripts such as stdout and stderr. For more information, see SageMaker HyperPod cluster management.

Compatible with SageMaker tools

Using SageMaker HyperPod, you can configure clusters with AWS optimized collective communications libraries offered by SageMaker, such as the SageMaker distributed data parallelism (SMDDP) library. The SMDDP library implements the AllGather operation optimized to the AWS compute and network infrastructure for the most performant SageMaker machine learning instances powered by NVIDIA A100 GPUs. To learn more, see Run distributed training workloads with Slurm on HyperPod.

Topics

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Provide users with access to custom images

Prerequisites