Compatibility with the SMDDP library optimized for AWS infrastructure

You can use the SageMaker model parallelism library v2 (SMP v2) in conjunction with the SageMaker distributed data parallelism (SMDDP) library that offers the AllGather collective communication operation optimized for AWS infrastructure. In distributed training, collective communication operations are designed for synchronizing multiple GPU workers and exchange information between them. AllGather is one of the core collective communication operations typically used in sharded data parallelism. To learn more about the SMDDP AllGather operation, see SMDDP AllGather collective operation Optimizing such collective communication operations would directly contribute to a faster end-to-end training without side effects on convergence.

Note

The SMDDP library supports P4 and P4de instances (see also Supported frameworks, AWS Regions, and instances types by the SMDDP library).

The SMDDP library integrates natively with PyTorch through the process group layer. To use the SMDDP library, you only need to add two lines of code to your training script. It supports any training frameworks such as SageMaker Model Parallelism Library, PyTorch FSDP, and DeepSpeed.

To activate SMDDP and use its AllGather operation, you need to add two lines of code to your training script as part of Step 1: Adapt your PyTorch FSDP training script. Note that you need to initialize PyTorch Distributed with the SMDDP backend first, and then run the SMP initialization.


import torch.distributed as dist

# Initialize with SMDDP
import smdistributed.dataparallel.torch.torch_smddp
dist.init_process_group(backend="smddp") # Replacing "nccl"

 # Initialize with SMP
import torch.sagemaker as tsm
tsm.init()

SageMaker Framework Containers for PyTorch (see also Supported frameworks and AWS Regions by SMP v2 and Supported frameworks, AWS Regions, and instances types by the SMDDP library) are pre-packaged with the SMP binary and the SMDDP binary. To learn more about the SMDDP library, see Run distributed training with the SageMaker AI distributed data parallelism library.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Context parallelism

Mixed precision training