SageMaker distributed model parallel Configuration Tips and Pitfalls - Amazon SageMaker

SageMaker distributed model parallel Configuration Tips and Pitfalls

Review the following tips and pitfalls before using Amazon SageMaker's distributed model parallel library. This list includes tips that are applicable across frameworks. For TensorFlow and PyTorch specific tips, see Modify a TensorFlow Training Script and Modify a PyTorch Training Script, respectively.

Batch Size and Number of Microbatches

  • The library is most efficient when the batch size is increased. For use cases where the model fits within a single device, but can only be trained with a small batch size, batch size can and should be increased after the library is integrated. Model parallelism saves memory for large models, enabling you to train using batch sizes that previously did not fit in memory.

  • Choosing a number of microbatches that is too small or too large can hurt performance. The library executes each microbatch sequentially in each device, so microbatch size (batch size divided by number of microbatches) must be large enough to fully utilize each GPU. At the same time, pipeline efficiency increases with the number of microbatches, so striking the right balance is important. Typically, a good starting point is to try 2 or 4 microbatches, increasing the batch size to the memory limit, and then experiment with larger batch sizes and numbers of microbatches. As the number of microbatches is increased, larger batch sizes might become feasible if an interleaved pipeline is used.

  • Your batch size must be always divisible by the number of microbatches. Note that depending on the size of the dataset, sometimes the last batch of every epoch can be of a smaller size than the rest, and this smaller batch needs to be divisible by the number of microbatches as well. If it is not, you can set drop_remainder=True in the tf.Dataset.batch() call (in TensorFlow), or set drop_last=True in DataLoader (in PyTorch), so that this last, small batch is not used. If you are using a different API for the data pipeline, you might need to manually skip the last batch whenever it is not divisible by the number of microbatches.

Manual Partitioning

  • If you use manual partitioning, be mindful of the parameters that are consumed by multiple operations and modules in your model, such as the embedding table in transformer architectures. Modules that share the same parameter must be placed in the same device for correctness. When auto-partitioning is used, the library automatically enforces this constraint.

Data Preparation

  • If the model takes multiple inputs, make sure you seed the random operations in your data pipeline (e.g., shuffling) with smp.dp_rank(). If the dataset is being deterministically sharded across data parallel devices, make sure that the shard is indexed by smp.dp_rank(). This is to make sure that the order of the data seen on all ranks that form a model partition is consistent.

Returning Tensors from smp.DistributedModel

  • Any tensor that is returned from the smp.DistributedModel.call (for TensorFlow) or smp.DistributedModel.forward (for PyTorch) function is broadcast to all other ranks, from the rank that computed that particular tensor. As a result, any tensor that is not needed outside the call and forward methods (intermediate activations, for example) should not be returned, as this causes needless communication and memory overhead and hurts performance.

The @smp.step Decorator

  • If an smp.step-decorated function has a tensor argument that does not have a batch dimension, the argument name must be provided in the non_split_inputs list when calling smp.step. This prevents the library from attempting to split the tensor into microbatches. For more information see smp.step in the API documentation.