Ranking Mechanism when Using a Combination of Pipeline Parallelism and Tensor Parallelism
This section explains how the ranking mechanism of model parallelism works with
tensor parallelism. This is extended from the Ranking Basicssmp.tp_rank()
for tensor parallel rank, smp.pp_rank()
for pipeline parallel rank, and smp.rdp_rank()
for reduced-data
parallel rank. The corresponding communication process groups are tensor parallel
group (TP_GROUP
), pipeline parallel group (PP_GROUP
), and
reduced-data parallel group (RDP_GROUP
). These groups are defined as
follows:
-
A tensor parallel group (
TP_GROUP
) is an evenly divisible subset of the data parallel group, over which tensor parallel distribution of modules takes place. When the degree of pipeline parallelism is 1,TP_GROUP
is the same as model parallel group (MP_GROUP
). -
A pipeline parallel group (
PP_GROUP
) is the group of processes over which pipeline parallelism takes place. When the tensor parallelism degree is 1,PP_GROUP
is the same asMP_GROUP
. -
A reduced-data parallel group (
RDP_GROUP
) is a set of processes that hold both the same pipeline parallelism partitions and the same tensor parallel partitions, and perform data parallelism among themselves. This is called the reduced data parallel group because it is a subset of the entire data parallelism group,DP_GROUP
. For the model parameters that are distributed within theTP_GROUP
, the gradientallreduce
operation is performed only for reduced-data parallel group, while for the parameters that are not distributed, the gradientallreduce
takes place over the entireDP_GROUP
. -
A model parallel group (
MP_GROUP
) refers to a group of processes that collectively store the entire model. It consists of the union of thePP_GROUP
s of all the ranks that are in theTP_GROUP
of the current process. When the degree of tensor parallelism is 1,MP_GROUP
is equivalent toPP_GROUP
. It is also consistent with the existing definition ofMP_GROUP
from previoussmdistributed
releases. Note that the currentTP_GROUP
is a subset of both the currentDP_GROUP
and the currentMP_GROUP
.
To learn more about the communication process APIs in the SageMaker model parallelism
library, see the Common API
For example, consider process groups for a single node with 8 GPUs, where the
degree of tensor parallelism is 2, the degree of pipeline parallelism is 2, and the
degree of data parallelism is 4. The upper center part of the preceding figure shows
an example of a model with 4 layers. The lower left and lower right parts of figure
illustrate the 4-layer model distributed across 4 GPUs using both pipeline
parallelism and tensor parallelism, where tensor parallelism is used for the middle
two layers. These two lower figures are simple copies to illustrate different group
boundary lines. The partitioned model is replicated for data parallelism across GPUs
0-3 and 4-7. The lower left figure shows the definitions of MP_GROUP
,
PP_GROUP
, and TP_GROUP
. The lower right figure shows
RDP_GROUP
, DP_GROUP
, and WORLD
over the
same set of GPUs. The gradients for the layers and layer slices that have the same
color are allreduce
d together for data parallelism. For example, the
first layer (light blue) gets the allreduce
operations across
DP_GROUP
, whereas the dark orange slice in the second layer only
gets the allreduce
operations within the RDP_GROUP
of its
process. The bold dark red arrows represent tensors with the batch of its entire
TP_GROUP
.
GPU0: pp_rank 0, tp_rank 0, rdp_rank 0, dp_rank 0, mp_rank 0 GPU1: pp_rank 1, tp_rank 0, rdp_rank 0, dp_rank 0, mp_rank 1 GPU2: pp_rank 0, tp_rank 1, rdp_rank 0, dp_rank 1, mp_rank 2 GPU3: pp_rank 1, tp_rank 1, rdp_rank 0, dp_rank 1, mp_rank 3 GPU4: pp_rank 0, tp_rank 0, rdp_rank 1, dp_rank 2, mp_rank 0 GPU5: pp_rank 1, tp_rank 0, rdp_rank 1, dp_rank 2, mp_rank 1 GPU6: pp_rank 0, tp_rank 1, rdp_rank 1, dp_rank 3, mp_rank 2 GPU7: pp_rank 1, tp_rank 1, rdp_rank 1, dp_rank 3, mp_rank 3
In this example, pipeline parallelism occurs across the GPU pairs (0,1); (2,3);
(4,5) and (6,7). In addition, data parallelism (allreduce
) takes place
across GPUs 0, 2, 4, 6, and independently over GPUs 1, 3, 5, 7. Tensor parallelism
happens over subsets of DP_GROUP
s, across the GPU pairs (0,2); (1,3);
(4,6) and (5,7).