Core features of the SageMaker AI model parallelism library v2
The Amazon SageMaker AI model parallelism library v2 (SMP v2) offers distribution strategies and
memory-saving techniques, such as sharded data parallelism, tensor parallelism, and
checkpointing. The model parallelism strategies and techniques offered by SMP v2 help
distribute large models across multiple devices while optimizing training speed and memory
consumption. SMP v2 also provides a Python package torch.sagemaker
to help
adapt your training script with few lines of code change.
This guide follows the basic two-step flow introduced in Use the SageMaker AI model parallelism library v2. To dive deep into the core features of SMP v2 and how to use them, see the following topics.
Note
These core features are available in SMP v2.0.0 and later and the SageMaker Python SDK v2.200.0 and later, and works for PyTorch v2.0.1 and later. To check the versions of the packages, see Supported frameworks and AWS Regions.
Topics
- Hybrid sharded data parallelism
- Expert parallelism
- Context parallelism
- Compatibility with the SMDDP library optimized for AWS infrastructure
- Mixed precision training
- Delayed parameter initialization
- Activation checkpointing
- Activation offloading
- Tensor parallelism
- Fine-tuning
- FlashAttention
- Checkpointing using SMP