SageMaker Distributed Model Parallelism Best Practices - Amazon SageMaker

SageMaker Distributed Model Parallelism Best Practices

Use the following guidelines when you run a distributed training job with the SageMaker model parallel library.

Setting Up the Right Configuration for a Given Model

When scaling up a model, we recommend you to go over the following list in order. Each list item discusses the advantage of using the library's techniques along with the tradeoffs that might arise.


If a model can fit well using a subset of the library's features, adding more model parallelism or memory saving features does not usually improve performance.

Using large GPU instance types
  • In the realm of model parallelism, it is best to use powerful instances with large GPU memories to handle overhead from model parallelism operations such as partitioning models across multiple GPUs. We recommend using ml.p4d or ml.p3dn instances for training large DL models. These instances are also equipped with Elastic Fabric Adapter (EFA), which provides higher network bandwidth and enables large-scale training with model parallelism.

Sharding optimizer state
  • The impact of sharding optimizer state depends on the number of data parallel ranks. Typically, a higher degree of data parallelism (proportional to the size of compute node) can improve the efficiency of memory usage.

    When you want to downsize a cluster, make sure you check the optimizer state sharding configuration. For example, a large DL model with optimizer state sharding that fits on a compute cluster with 16 GPUs (for example, two P4d or P4de instances) might not always fit on a node with 8 GPUs (for example, a single P4d or P4de instance). This is because the combined memory of 8 GPUs is lower than the combined memory of 16 GPUs, and the required memory per GPU for sharding over 8 GPUs is also higher than the memory per GPU for sharding over the 16-GPU scenario. As a result, the increased memory requirement might not fit into the smaller cluster.

    For more information, see Optimizer State Sharding.

Activation checkpointing
  • Memory efficiency can be improved by using activation checkpointing for a group of modules. The more you group the modules, the more efficient the memory usage. When checkpointing sequential modules for layers, the strategy argument of the smp.set_activation_checkpointing function groups the layers together for checkpointing. For example, grouping two or more layers together for checkpointing is more memory efficient than checkpointing one layer at a time, and this trades extra computation time for reduced memory usage.

    For more information, see Activation Checkpointing.

Tensor parallelism
  • The degree of tensor parallelism should be a power of two (2, 4, 8, ..., 2n), where the maximum degree must be equal to the number of GPUs per node. For example, if you use a node with 8 GPUs, possible numbers for the degree of tensor parallelism are 2, 4, and 8. We don’t recommend arbitrary numbers (such as 3, 5, 6, and 7) for the degree of tensor parallelism. When you use multiple nodes, misconfiguring the degree of tensor parallelism might result in running tensor parallelism across the nodes; this adds significant overhead from communication of activations across the nodes and can become computationally expensive.

    For more information, see Tensor Parallelism.

Pipeline parallelism across nodes
  • You can run pipeline parallelism both within a single node and across multiple nodes. When you use pipeline parallelism in combination with tensor parallelism, we recommend running pipeline parallelism across multiple nodes and keeping tensor parallelism within individual nodes.

  • Pipeline parallelism comes with the following three knobs: microbatches, active_microbatches, and prescaled_batch.

    • When you use tensor parallelism with pipeline parallelism, we recommend activating prescaled_batch so that the batch size per model parallel group can be increased for efficient pipelining. With prescaled_batch activated, the batch size set in the training script becomes tp_size times the batch size set for each rank without prescaled_batch.

    • Increasing the number of microbatches helps achieve efficient pipelining and better performance. Note that the effective microbatch size is the batch size divided by number of microbatches. If you increase the number of microbatches while keeping batch size constant, each microbatch processes fewer samples.

    • The number of active_microbatches is the maximum number of microbatches that are simultaneously in process during pipelining. For each active microbatch in process, its activations and gradients take up GPU memory. Therefore, increasing active_microbatches takes up more GPU memory.

  • If both GPU and GPU memory are underutilized, increase active_microbatches for better parallelization during pipelining.

  • For more information about how to use tensor parallelism with pipeline parallelism, see Tensor parallelism combined with pipeline parallelism.

  • To find descriptions of the aforementioned parameters, see Parameters for smdistributed in the SageMaker Python SDK documentation.

Offloading activations to CPU
  • Make sure that this is used in combination with activation checkpointing and pipeline parallelism. To ensure that the offloading and preloading happen in the background, specify a value greater than 1 to the microbatches parameter.

  • When offloading activations, you might be able to increase active_microbatches and sometimes match with the total number of microbatches. This depends on which modules are checkpointed and how the model is partitioned.

    For more information, see Activation Offloading.

Reference configurations

The SageMaker model parallelism training team provides the following reference points based on experiments with the GPT-2 model, the sequence length of 512, and the vocabulary size of 50,000.

The number of model parameters Instance type Pipeline parallelism Tensor parallelism Optimizer state sharding Activation checkpointing Prescaled batch Batch size
10 billion 16 ml.p4d.24xlarge 1 4 True Each transformer layer True batch_size=40
30 billion 16 ml.p4d.24xlarge 1 8 True Each transformer layer True batch_size=32
60 billion 32 ml.p4d.24xlarge 2 8 True Each transformer layer True batch_size=56, microbatches=4, active_microbatches=2

You can extrapolate from the preceding configurations to estimate GPU memory usage for your model configuration. For example, if you increase the sequence length for a 10-billion-parameter model or increase the size of the model to 20 billion, you might want to lower batch size first. If the model still doesn’t fit, try increasing the degree of tensor parallelism.

Modifying Your Training Script

  • Before you use the SageMaker model parallel library’s features in your training script, review The SageMaker Distributed Model Parallelism Library Configuration Tips and Pitfalls.

  • To launch a training job faster, use the SageMaker local mode. This helps you quickly run a training job locally on a SageMaker notebook instance. Depending on the scale of the ML instance on which your SageMaker notebook instance is running, you might need to adjust the size of your model by changing the model configurations, such as the hidden width, number of transformer layers, and attention heads. Validate if the reduced model runs well on the notebook instance before using a large cluster for training the full model.

Monitoring and Logging a Training Job Using the SageMaker Console and Amazon CloudWatch

To monitor system-level metrics such as CPU memory utilization, GPU memory utilization, and GPU utilization, use visualization provided through the SageMaker console.

  1. In the left navigation pane, choose Training.

  2. Choose Training jobs.

  3. In the main pane, choose the training job name for which you want to see more details.

  4. Browse the main pane and find the Monitor section to see the automated visualization.

  5. To see training job logs, choose View logs in the Monitor section. You can access the distributed training job logs of the training job in CloudWatch. If you launched multi-node distributed training, you should see multiple log streams with tags in the format of algo-n-1234567890. The algo-1 log stream tracks training logs from the main (0th) node.

For more information, see Monitor and Analyze Training Jobs Using Amazon CloudWatch Metrics.


To run a SageMaker training job with model parallelism or the SageMaker distributed training example notebooks, make sure you have the right permissions in your IAM role, such as the following: