SageMaker's Data Parallelism Library
The SageMaker data parallelism library extends SageMaker training capabilities on deep learning models with near-linear scaling efficiency, achieving fast time-to-train with minimal code changes.
When training a model on a large amount of data, machine learning practitioners often turn to distributed training to reduce the time to train. In some cases, where time is of the essence, the business requirement is to finish training as quickly as possible or at least within a constrained time period. Then, distributed training is scaled to use a cluster of multiple nodes—not just multiple GPUs in a computing instance, but multiple instances with multiple GPUs. As the cluster size increases, so does the significant drop in performance. This drop in performance is primarily caused by the communications overhead between nodes in a cluster.
To resolve such overhead problems, SageMaker offers two distributed training options: SageMaker model parallelism and SageMaker data parallelism. This guide focuses on how to train models using the SageMaker data parallelism library.
-
The library optimizes your training job for AWS network infrastructure and Amazon EC2 instance topology.
-
The library takes advantage of gradient updates to communicate between nodes with a custom
AllReduce
algorithm.
To track the latest updates of the library, see the SageMaker Distributed Data Parallel Release Notes
For more information about training with a model-parallel strategy, see SageMaker's Model Parallelism Library.
Topics
- Introduction to SageMaker's Distributed Data Parallel Library
- Supported Frameworks, AWS Regions, and Instances Types
- Run a SageMaker Distributed Training Job with Data Parallelism
- SageMaker Distributed Data Parallel Configuration Tips and Pitfalls
- Amazon SageMaker Data Parallel Library FAQ
- Data Parallel Troubleshooting