Compatibility with the SMDDP library optimized for AWS infrastructure
You can use the SageMaker model parallelism library v2 (SMP v2) in conjunction with the
SageMaker distributed data parallelism (SMDDP)
library that offers the AllGather
collective communication
operation optimized for AWS infrastructure. In distributed training, collective
communication operations are designed for synchronizing multiple GPU workers and
exchange information between them. AllGather
is one of the core collective
communication operations typically used in sharded data parallelism. To learn more about
the SMDDP AllGather
operation, see SMDDP AllGather collective
operation Optimizing such collective communication operations would directly contribute to a
faster end-to-end training without side effects on convergence.
Note
The SMDDP library supports P4 and P4de instances (see also Supported frameworks, AWS Regions, and instances types by the SMDDP library).
The SMDDP library integrates natively with PyTorch through the process group
To activate SMDDP and use its AllGather
operation, you need to add two
lines of code to your training script as part of Step 1: Adapt your PyTorch FSDP training script. Note that you need to initialize
PyTorch Distributed with the SMDDP backend first, and then run the SMP
initialization.
import torch.distributed as dist # Initialize with SMDDP import smdistributed.dataparallel.torch.torch_smddp dist.init_process_group(backend="smddp") # Replacing "nccl" # Initialize with SMP import torch.sagemaker as tsm tsm.init()
SageMaker Framework Containers