Run distributed training workloads with Slurm on HyperPod - Amazon SageMaker

Run distributed training workloads with Slurm on HyperPod

SageMaker HyperPod is specialized for workloads of training large language models (LLMs) and foundation models (FMs). These workloads often require the use of multiple parallelism techniques and optimized operations for ML infrastructure and resources. Using SageMaker HyperPod, you can use the following SageMaker distributed training frameworks:

Using SMDDP on a SageMaker HyperPod

The SMDDP library is a collective communication library that improves compute performance of distributed data parallel training. The SMDDP library works with the following open source distributed training frameworks:

The SMDDP library addresses communications overhead of the key collective communication operations by offering the following for SageMaker HyperPod.

  • The library offers AllGather optimized for AWS. AllGather is a key operation used in sharded data parallel training, which is a memory-efficient data parallelism technique offered by popular libraries. These include the SageMaker model parallelism (SMP) library, DeepSpeed Zero Redundancy Optimizer (ZeRO), and PyTorch Fully Sharded Data Parallelism (FSDP).

  • The library performs optimized node-to-node communication by fully utilizing the AWS network infrastructure and the SageMaker ML instance topology.

To run sample data-parallel training jobs

Explore the following distributed training samples implementing data parallelism techniques using the SMDDP library.

To set up an environment for using the SMDDP library on SageMaker HyperPod

The following are training environment requirements for using the SMDDP library on SageMaker HyperPod.

  • PyTorch v2.0.1 and later

  • CUDA v11.8 and later

  • libstdc++ runtime version greater than 3

  • Python v3.10.x and later

  • ml.p4d.24xlarge and ml.p4de.24xlarge, which are supported instance types by the SMDDP library

  • imdsv2 enabled on training host

Depending on how you want to run the distributed training job, there are two options to install the SMDDP library:

  • A direct installation using the SMDDP binary file.

  • Using the SageMaker Deep Learning Containers (DLCs) pre-installed with the SMDDP library.

Docker images pre-installed with the SMDDP library or the URLs to the SMDDP binary files are listed at Supported Frameworks in the SMDDP library documentation.

To install the SMDDP library on the SageMaker HyperPod DLAMI
  • pip install --no-cache-dir https://smdataparallel.s3.amazonaws.com/binary/pytorch/<pytorch-version>/cuXYZ/YYYY-MM-DD/smdistributed_dataparallel-X.Y.Z-cp310-cp310-linux_x86_64.whl

    Note

    If you work in a Conda environment, ensure that you install PyTorch using conda install instead of pip.

    conda install pytorch==X.Y.Z torchvision==X.Y.Z torchaudio==X.Y.Z pytorch-cuda=X.Y.Z -c pytorch -c nvidia
To use the SMDDP library on a Docker container
  • The SMDDP library is pre-installed on the SageMaker Deep Learning Containers (DLCs). To find the list of SageMaker framework DLCs for PyTorch with the SMDDP library, see Supported Frameworks in the SMDDP library documentation. You can also bring your own Docker container with required dependencies installed to use the SMDDP library. To learn more about setting up a custom Docker container to use the SMDDP library, see also Create your own Docker container with the SageMaker distributed data parallel library.

    Important

    To use the SMDDP library in a Docker container, mount the /var/log directory from the host machine onto /var/log in the container. This can be done by adding the following option when running your container.

    docker run <OTHER_OPTIONS> -v /var/log:/var/log ...

To learn how to run data-parallel training jobs with SMDDP in general, see How to run a distributed training job with the SageMaker distributed data parallelism library.

Using SMP on a SageMaker HyperPod cluster

The SageMaker model parallelism (SMP) library offers various state-of-the-art model parallelism techniques, including:

  • fully sharded data parallelism

  • expert parallelism

  • mixed precision training with FP16/BF16 and FP8 data types

  • tensor parallelism

The SMP library is also compatible with open source frameworks such as PyTorch FSDP, NVIDIA Megatron, and NVIDIA Transformer Engine.

To run a sample model-parallel training workload

The SageMaker service teams provide sample training jobs implementing model parallelism with the SMP library at awsome-distributed-training/3.test_cases/17.SM-modelparallelv2.