Setting Up the Right Configuration for a Given Model Modifying Your Training Script Monitoring and Logging a Training Job Using the SageMaker Console and Amazon CloudWatch Permissions

SageMaker Distributed Model Parallelism Best Practices

Use the following guidelines when you run a distributed training job with the SageMaker model parallel library.

Setting Up the Right Configuration for a Given Model

When scaling up a model, we recommend you to go over the following list in order. Each list item discusses the advantage of using the library's techniques along with the tradeoffs that might arise.

Tip

If a model can fit well using a subset of the library's features, adding more model parallelism or memory saving features does not usually improve performance.

Using large GPU instance types

In the realm of model parallelism, it is best to use powerful instances with large GPU memories to handle overhead from model parallelism operations such as partitioning models across multiple GPUs. We recommend using ml.p4d or ml.p3dn instances for training large DL models. These instances are also equipped with Elastic Fabric Adapter (EFA), which provides higher network bandwidth and enables large-scale training with model parallelism.

Sharding optimizer state

The impact of sharding optimizer state depends on the number of data parallel ranks. Typically, a higher degree of data parallelism (proportional to the size of compute node) can improve the efficiency of memory usage.

When you want to downsize a cluster, make sure you check the optimizer state sharding configuration. For example, a large DL model with optimizer state sharding that fits on a compute cluster with 16 GPUs (for example, two P4d or P4de instances) might not always fit on a node with 8 GPUs (for example, a single P4d or P4de instance). This is because the combined memory of 8 GPUs is lower than the combined memory of 16 GPUs, and the required memory per GPU for sharding over 8 GPUs is also higher than the memory per GPU for sharding over the 16-GPU scenario. As a result, the increased memory requirement might not fit into the smaller cluster.

For more information, see Optimizer State Sharding.

Activation checkpointing

Memory efficiency can be improved by using activation checkpointing for a group of modules. The more you group the modules, the more efficient the memory usage. When checkpointing sequential modules for layers, the strategy argument of the smp.set_activation_checkpointing function groups the layers together for checkpointing. For example, grouping two or more layers together for checkpointing is more memory efficient than checkpointing one layer at a time, and this trades extra computation time for reduced memory usage.

For more information, see Activation Checkpointing.

Tensor parallelism

The degree of tensor parallelism should be a power of two (2, 4, 8, ..., 2ⁿ), where the maximum degree must be equal to the number of GPUs per node. For example, if you use a node with 8 GPUs, possible numbers for the degree of tensor parallelism are 2, 4, and 8. We don’t recommend arbitrary numbers (such as 3, 5, 6, and 7) for the degree of tensor parallelism. When you use multiple nodes, misconfiguring the degree of tensor parallelism might result in running tensor parallelism across the nodes; this adds significant overhead from communication of activations across the nodes and can become computationally expensive.

For more information, see Tensor Parallelism.

Pipeline parallelism across nodes

You can run pipeline parallelism both within a single node and across multiple nodes. When you use pipeline parallelism in combination with tensor parallelism, we recommend running pipeline parallelism across multiple nodes and keeping tensor parallelism within individual nodes.
Pipeline parallelism comes with the following three knobs: microbatches, active_microbatches, and prescaled_batch.
- When you use tensor parallelism with pipeline parallelism, we recommend activating prescaled_batch so that the batch size per model parallel group can be increased for efficient pipelining. With prescaled_batch activated, the batch size set in the training script becomes tp_size times the batch size set for each rank without prescaled_batch.
- Increasing the number of microbatches helps achieve efficient pipelining and better performance. Note that the effective microbatch size is the batch size divided by number of microbatches. If you increase the number of microbatches while keeping batch size constant, each microbatch processes fewer samples.
- The number of active_microbatches is the maximum number of microbatches that are simultaneously in process during pipelining. For each active microbatch in process, its activations and gradients take up GPU memory. Therefore, increasing active_microbatches takes up more GPU memory.
If both GPU and GPU memory are underutilized, increase active_microbatches for better parallelization during pipelining.
For more information about how to use tensor parallelism with pipeline parallelism, see Tensor parallelism combined with pipeline parallelism.
To find descriptions of the aforementioned parameters, see Parameters for smdistributed in the SageMaker Python SDK documentation.

Offloading activations to CPU

Make sure that this is used in combination with activation checkpointing and pipeline parallelism. To ensure that the offloading and preloading happen in the background, specify a value greater than 1 to the microbatches parameter.
When offloading activations, you might be able to increase active_microbatches and sometimes match with the total number of microbatches. This depends on which modules are checkpointed and how the model is partitioned.

For more information, see Activation Offloading.

Reference configurations

The SageMaker model parallelism training team provides the following reference points based on experiments with the GPT-2 model, the sequence length of 512, and the vocabulary size of 50,000.

The number of model parameters	Instance type	Pipeline parallelism	Tensor parallelism	Optimizer state sharding	Activation checkpointing	Prescaled batch	Batch size
10 billion	16 `ml.p4d.24xlarge`	1	4	True	Each transformer layer	True	`batch_size=40`
30 billion	16 `ml.p4d.24xlarge`	1	8	True	Each transformer layer	True	`batch_size=32`
60 billion	32 `ml.p4d.24xlarge`	2	8	True	Each transformer layer	True	`batch_size=56`, `microbatches=4`, `active_microbatches=2`

You can extrapolate from the preceding configurations to estimate GPU memory usage for your model configuration. For example, if you increase the sequence length for a 10-billion-parameter model or increase the size of the model to 20 billion, you might want to lower batch size first. If the model still doesn’t fit, try increasing the degree of tensor parallelism.

Modifying Your Training Script

Before you use the SageMaker model parallel library’s features in your training script, review The SageMaker Distributed Model Parallelism Library Configuration Tips and Pitfalls.
To launch a training job faster, use the SageMaker local mode. This helps you quickly run a training job locally on a SageMaker notebook instance. Depending on the scale of the ML instance on which your SageMaker notebook instance is running, you might need to adjust the size of your model by changing the model configurations, such as the hidden width, number of transformer layers, and attention heads. Validate if the reduced model runs well on the notebook instance before using a large cluster for training the full model.

Monitoring and Logging a Training Job Using the SageMaker Console and Amazon CloudWatch

To monitor system-level metrics such as CPU memory utilization, GPU memory utilization, and GPU utilization, use visualization provided through the SageMaker console.

In the left navigation pane, choose Training.
Choose Training jobs.
In the main pane, choose the training job name for which you want to see more details.
Browse the main pane and find the Monitor section to see the automated visualization.
To see training job logs, choose View logs in the Monitor section. You can access the distributed training job logs of the training job in CloudWatch. If you launched multi-node distributed training, you should see multiple log streams with tags in the format of algo-n-1234567890. The algo-1 log stream tracks training logs from the main (0th) node.

For more information, see Monitor and Analyze Training Jobs Using Amazon CloudWatch Metrics.

Permissions

To run a SageMaker training job with model parallelism or the SageMaker distributed training example notebooks, make sure you have the right permissions in your IAM role, such as the following:

To use FSx for Lustre, add AmazonFSxFullAccess.
To use Amazon S3 as a data channel, add AmazonS3FullAccess.
To use Docker, build your own container, and push it to Amazon ECR, add AmazonEC2ContainerRegistryFullAccess.
To have a full access to use the entire suite of SageMaker features, add AmazonSageMakerFullAccess.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Examples

Configuration Tips and Pitfalls