SageMaker's Distributed Model Parallel - Amazon SageMaker

SageMaker's Distributed Model Parallel


To use new features with an existing notebook instance or Studio app, restart it to get the latest updates.

Amazon SageMaker's distributed model parallel library (the library) can be used to training large deep learning models that were previously difficult to train due to GPU memory limitations. The library automatically and efficiently splits a model across multiple GPUs and instances and coordinates model training, allowing you to increase prediction accuracy by creating larger models with more parameters.

You can use the library to automatically partition your existing TensorFlow and PyTorch workloads across multiple GPUs with minimal code changes. You can access the library's API through the SageMaker SDK.

Use the following sections to learn more about model parallelism and the SageMaker model parallel library. This library's API documentation is located in the SageMaker Python SDK under Distributed Training APIs. To see the latest updates to the library, refer to the release notes.