Tensor Parallelism

Tensor parallelism is a type of model parallelism in which specific model weights, gradients, and optimizer states are split across devices. In contrast to pipeline parallelism, which keeps individual weights intact but partitions the set of weights, tensor parallelism splits individual weights. This typically involves distributed computation of specific operations, modules, or layers of the model.

Tensor parallelism is required in cases in which a single parameter consumes most of the GPU memory (such as large embedding tables with a large vocabulary size or a large softmax layer with a large number of classes). In this case, treating this large tensor or operation as an atomic unit is inefficient and impedes balance of the memory load.

Tensor parallelism is also useful for extremely large models in which a pure pipelining is simply not enough. For example, with GPT-3-scale models that require partitioning over tens of instances, a pure microbatch pipelining is inefficient because the pipeline depth becomes too high and the overhead becomes prohibitively large.

Note

Tensor parallelism is available for PyTorch in the SageMaker model parallelism library v1.6.0 and later.

Topics

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Pipelining a Model

How It Works