Extended Features of the SageMaker Model Parallel Library for PyTorch - Amazon SageMaker

Extended Features of the SageMaker Model Parallel Library for PyTorch

In addition to its core features, the SageMaker distributed model parallel library offers memory-saving features for training deep learning models with PyTorch: tensor parallelism, optimizer state sharding, activation checkpointing, and activation offloading.

Note

Extended memory-saving features are available through Deep Learning Containers for PyTorch, which implements the SageMaker distributed model parallel library v1.6.0 or later.

For each of the following features, you keep the same two-step workflow shown in the Run a SageMaker Distributed Training Job with Model Parallelism section and add few additional parameters and code lines to the SageMaker PyTorch estimator and your training script.

To find an example of how to use the extended features, see Train GPT-2 with PyTorch 1.8.1 and Tensor Parallelism Using the SageMaker Model Parallelism Library.