SageMaker's Data Parallelism Library - Amazon SageMaker

SageMaker's Data Parallelism Library

The SageMaker data parallelism library extends SageMaker training capabilities on deep learning models with near-linear scaling efficiency, achieving fast time-to-train with minimal code changes.

When training a model on a large amount of data, machine learning practitioners often turn to distributed training to reduce the time to train. In some cases, where time is of the essence, the business requirement is to finish training as quickly as possible or at least within a constrained time period. Then, distributed training is scaled to use a cluster of multiple nodes—not just multiple GPUs in a computing instance, but multiple instances with multiple GPUs. As the cluster size increases, so does the significant drop in performance. This drop in performance is primarily caused by the communications overhead between nodes in a cluster.

To resolve such overhead problems, SageMaker offers two distributed training options: SageMaker model parallelism and SageMaker data parallelism. This guide focuses on how to train models using the SageMaker data parallelism library.

  • The library optimizes your training job for AWS network infrastructure and Amazon EC2 instance topology.

  • The library takes advantage of gradient updates to communicate between nodes with a custom AllReduce algorithm.

To track the latest updates of the library, see the SageMaker Distributed Data Parallel Release Notes in the SageMaker Python SDK documentation.

For more information about training with a model-parallel strategy, see SageMaker's Model Parallelism Library.