You can launch distributed training by adding the distribution
argument to
the SageMaker AI framework estimators, PyTorch
TensorFlow
The following launcher options are available for launching PyTorch distributed training.
-
pytorchddp
– This option runsmpirun
and sets up environment variables needed for running PyTorch distributed training on SageMaker AI. To use this option, pass the following dictionary to thedistribution
parameter.{ "pytorchddp": { "enabled": True } }
-
torch_distributed
– This option runstorchrun
and sets up environment variables needed for running PyTorch distributed training on SageMaker AI. To use this option, pass the following dictionary to thedistribution
parameter.{ "torch_distributed": { "enabled": True } }
-
smdistributed
– This option also runsmpirun
but withsmddprun
that sets up environment variables needed for running PyTorch distributed training on SageMaker AI.{ "smdistributed": { "dataparallel": { "enabled": True } } }
If you chose to replace NCCL AllGather
to SMDDP
AllGather
, you can use all three options. Choose one option that fits
with your use case.
If you chose to replace NCCL AllReduce
with SMDDP
AllReduce
, you should choose one of the mpirun
-based
options: smdistributed
or pytorchddp
. You can also add
additional MPI options as follows.
{
"pytorchddp": {
"enabled": True,
"custom_mpi_options": "-verbose -x NCCL_DEBUG=VERSION"
}
}
{
"smdistributed": {
"dataparallel": {
"enabled": True,
"custom_mpi_options": "-verbose -x NCCL_DEBUG=VERSION"
}
}
}
The following code sample shows the basic structure of a PyTorch estimator with distributed training options.
from sagemaker.pytorch import PyTorch
pt_estimator = PyTorch(
base_job_name="training_job_name_prefix
",
source_dir="subdirectory-to-your-code
",
entry_point="adapted-training-script.py
",
role="SageMakerRole
",
py_version="py310
",
framework_version="2.0.1
",
# For running a multi-node distributed training job, specify a value greater than 1
# Example: 2,3,4,..8
instance_count=2
,
# Instance types supported by the SageMaker AI data parallel library:
# ml.p4d.24xlarge, ml.p4de.24xlarge
instance_type="ml.p4d.24xlarge
",
# Activate distributed training with SMDDP
distribution={ "pytorchddp": { "enabled": True } } # mpirun, activates SMDDP AllReduce OR AllGather
# distribution={ "torch_distributed": { "enabled": True } } # torchrun, activates SMDDP AllGather
# distribution={ "smdistributed": { "dataparallel": { "enabled": True } } } # mpirun, activates SMDDP AllReduce OR AllGather
)
pt_estimator.fit("s3://bucket/path/to/training/data
")
Note
PyTorch Lightning and its utility libraries such as Lightning Bolts are not
preinstalled in the SageMaker AI PyTorch DLCs. Create the following
requirements.txt
file and save in the source directory where you save
the training script.
# requirements.txt
pytorch-lightning
lightning-bolts
For example, the tree-structured directory should look like the following.
├──
pytorch_training_launcher_jupyter_notebook.ipynb
└── sub-folder-for-your-code ├──adapted-training-script.py
└──requirements.txt
For more information about specifying the source directory to place the
requirements.txt
file along with your training script and a job
submission, see Using third-party libraries
Considerations for activating SMDDP collective operations and using the right distributed training launcher options
-
SMDDP
AllReduce
and SMDDPAllGather
are not mutually compatible at present. -
SMDDP
AllReduce
is activated by default when usingsmdistributed
orpytorchddp
, which arempirun
-based launchers, and NCCLAllGather
is used. -
SMDDP
AllGather
is activated by default when usingtorch_distributed
launcher, andAllReduce
falls back to NCCL. -
SMDDP
AllGather
can also be activated when using thempirun
-based launchers with an additional environment variable set as follows.export SMDATAPARALLEL_OPTIMIZE_SDP=true