Run a SageMaker Distributed Data Parallel Training Job - Amazon SageMaker

Run a SageMaker Distributed Data Parallel Training Job

Learn how to run a distributed data parallel training job using the SageMaker Python SDK and your adapted training script with SageMaker's distributed data parallel library. For API specifications, refer to distributed data parallel in the SageMaker Python SDK documentation.

Use SageMaker's Distributed Data Parallel Library

To use SageMaker's distributed data parallel library, you start with a training script. Just as with any training job on SageMaker, you use your training script to launch a SageMaker training job. To get started, we recommend that you use the SageMaker data parallel library in the following ways:

With either option, you can then use the Amazon SageMaker Distributed Training Notebook Examples provided with this documentation to get started. 

Use the Data Parallel Library with SageMaker's Python SDK

To use SageMaker's distributed data parallel library, you must create a training script for one of the supported frameworks and launch the training job using the SageMaker Python SDK. To learn how you can incorporate the library into a training script, see Modify Your Training Script Using SageMaker's Distributed Model Parallel Library. The library API documentation is located in the SageMaker Python SDK. See SageMaker's data parallel API documentation.

SageMaker supports the following training environment configurations:

  • You can use a prebuilt TensorFlow or PyTorch container.

  • You can customize SageMaker prebuilt containers or extend them to handle any additional functional requirements for your algorithm or model that the prebuilt SageMaker Docker image doesn't support. For an example of how you can extend a pre-built container, see Extend a Prebuilt Container.

    To extend a pre-built container, or adapt your own container to use the library, you must use one of the PyTorch or TensorFlow GPU general framework base-images. The distributed data parallel library is included in all CUDA 11 (cu11x) TensorFlow 2.3.x and PyTorch 1.6.x versioned images and later. See Available Deep Learning Containers Images for a list of available images.

    Important

    It is recommended that you use the image that contains the latest version of TensorFlow or PyTorch to access the most up to date version of the SageMaker distributed data parallel library. All images that include TensorFlow 2.4.1 and PyTorch 1.8.1 versions and later support EFA instance types (ml.p3dn.24xlarge, ml.p4d.24xlarge).

    For example, if you are using PyTorch 1.8.1, your Dockerfile should contain a FROM statement similar to the following:

    # SageMaker PyTorch image FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.8.1-gpu-py36-cu111-ubuntu18.04 ENV PATH="/opt/ml/code:${PATH}" # this environment variable is used by the SageMaker PyTorch container to determine our user code directory. ENV SAGEMAKER_SUBMIT_DIRECTORY /opt/ml/code # /opt/ml and all subdirectories are utilized by SageMaker, use the /code subdirectory to store your user code. COPY cifar10.py /opt/ml/code/cifar10.py # Defines cifar10.py as script entrypoint ENV SAGEMAKER_PROGRAM cifar10.py
  • You can adapt your own Docker container to work with SageMaker using the SageMaker Training toolkit. For an example, see Adapting Your Own Training Container.

In all cases, you launch your job using a SageMaker Python SDK Estimator.

If you are using the SageMaker pre-built PyTorch or TensorFlow containers, see the following section to learn how configure an Estimator to use the library.

If you bring your own container or extend a pre-built container, you can create an instance of the Estimator class with the desired Docker image.

TensorFlow Estimator

You can use SageMaker's distributed data parallel library API by specifying smdistributed dataparallel as the distribution strategy in the SageMaker TensorFlow Estimator API. Refer to SageMaker TensorFlow Estimator API documentation for full details about the API.

class sagemaker.tensorflow.estimator.TensorFlow(py_version=None,framework_version=None,model_dir=None,image_uri=None,distribution=None,kwargs)

The following two parameters of the SageMaker Python SDK TensorFlow estimator are required to use smdistributed.dataparallel with TensorFlow.

distribution(dict)(optional): A dictionary with information on how to run distributed training (default: None). 

  • To use smdistributed.dataparallel as a distribution strategy, use the following option:

    distribution = { "smdistributed": { "dataparallel": { "enabled": True } } }
  • custom_mpi_options (str)(optional): Custom MPI options. The following is an example of how you can use this parameter when defining distribution. To learn more, see Custom MPI Options.

    distribution = { "smdistributed": { "dataparallel": { "enabled": True, "custom_mpi_options": "-verbose -x NCCL_DEBUG=VERSION" } } }

train_instance_type (str)(required): A type of Amazon EC2 instance to use.

  • If the smdistributed.dataparallel distribution strategy is used, the allowed instance types are: ml.p4d.24xlargeml.p3dn.24xlarge, and ml.p3.16xlarge. For best performance, it is recommended that you use an EFA supported instance, ml.p3dn.24xlarge or ml.p4d.24xlarge, with the latest supported version of TensorFlow.

Example

from sagemaker.tensorflow import TensorFlow tf_estimator = TensorFlow( base_job_name = "training_job_name_prefix", entry_point="tf-train.py", role="SageMakerRole", framework_version="2.4.1", # You must set py_version to py36 py_version="py37",     # For training with multi node distributed training, set this count. # Example: 2,3,4,..8 instance_count=2,     # For training with p3dn instance use - ml.p3dn.24xlarge instance_type="ml.p3.16xlarge",     # Training using smdistributed.dataparallel Distributed Training Framework distribution={"smdistributed": {"dataparallel": {"enabled": True}}} ) tf_estimator.fit("s3://bucket/path/to/training/data")
Note

If you are using the SageMaker Python SDK 1.x, you need to use hyperparameters to specify to use smdistributed dataparallel as the distributed training strategy.

from sagemaker.tensorflow import TensorFlow tf_estimator = TensorFlow(                           ...                           # Training using smdistributed.dataparallel Distributed Training Framework                           hyperparameters={"sagemaker_distributed_dataparallel_enabled": True},                           ...                          ) tf_estimator.fit("s3://bucket/path/to/training/data")

PyTorch Estimator

You can use SageMaker's distributed data parallel library API by specifying smdistributed dataparallel as the distribution strategy in SageMaker PyTorch Estimator API. Refer to the SageMaker PyTorch Estimator API specification for full details of the API.

class sagemaker.pytorch.estimator.PyTorch(py_version=None,framework_version=None,model_dir=None,image_uri=None, hyperparameters=None, distribution=None,kwargs)

The following two parameters of the SageMaker PyTorch Estimator are necessary for using smdistributed.dataparallel on SageMaker with PyTorch.

distribution (dict)(optional):  A dictionary with information on how to run distributed training (default: None). 

  • To usesmdistributed.dataparallel as a distribution strategy, use the following option:

    distribution={ "smdistributed": { "dataparallel": { "enabled": True } } }
  • custom_mpi_options (str)(optional): Custom MPI options. The following is an example of how you can use this parameter when defining distribution. To learn more, see Custom MPI Options.

    distribution = { "smdistributed": { "dataparallel": { "enabled": True, "custom_mpi_options": "-verbose -x NCCL_DEBUG=VERSION" } } }

train_instance_type (str)(required): A type of Amazon EC2 instance to use.

  • If the smdistributed.dataparallel distribution strategy is used, the allowed instance types are: ml.p4d.24xlargeml.p3dn.24xlarge, and ml.p3.16xlarge. For best performance, it is recommended that you use an EFA supported instance, ml.p3dn.24xlarge or ml.p4d.24xlarge, with the latest supported version of PyTorch.

Example

from sagemaker.pytorch import PyTorch pt_estimator = PyTorch( base_job_name="training_job_name_prefix", entry_point="pt-train.py", role="SageMakerRole", # You must set py_version to py36 py_version="py36", framework_version="1.8.1",     # For training with multi node distributed training, set this count.     # Example: 2,3,4,..8 instance_count=2,     # For training with p3dn instance use - ml.p3dn.24xlarge instance_type="ml.p3.16xlarge",     # Training using smdistributed.dataparallel Distributed Training Framework distribution={"smdistributed": {"dataparallel": {"enabled": True}}} ) pt_estimator.fit("s3://bucket/path/to/training/data")                       
Note

If you are using SageMaker Python SDK 1.x, you need to use hyperparameters to specify smdistributed.dataparallel as the distributed training strategy.

from sagemaker.pytorch import PyTorch pt_estimator = PyTorch(                        ...                        # Training using smdistributed.dataparallel Distributed Training Framework                        hyperparameters={"sagemaker_distributed_dataparallel_enabled": True},                        ...                       ) pt_estimator.fit("s3://bucket/path/to/training/data")