Step 2: Launch a Training Job Using the SageMaker Python SDK
The SageMaker Python SDK supports managed training of models with ML frameworks such as
TensorFlow and PyTorch. To launch a training job using one of these frameworks, you define a
SageMaker TensorFlow estimator
Topics
Using the SageMaker TensorFlow and PyTorch Estimators
The TensorFlow and PyTorch estimator classes contain the distribution
parameter, which you can use to specify configuration parameters for using
distributed training frameworks. The SageMaker model parallel library internally uses MPI
for hybrid data and model parallelism, so you must use the MPI option with the
library.
The following template of a TensorFlow or PyTorch estimator shows how to configure
the distribution
parameter for using the SageMaker model parallel library
with MPI.
To enable the library, you need to pass configuration dictionaries to the
"smdistributed"
and "mpi"
keys through the
distribution
argument of the SageMaker estimator constructors.
Configuration parameters for SageMaker model parallelism
-
For the
"smdistributed"
key, pass a dictionary with the"modelparallel"
key and the following inner dictionaries.Note
Using
"modelparallel"
and"dataparallel"
in one training job is not supported.-
"enabled"
– Required. To enable model parallelism, set"enabled": True
. -
"parameters"
– Required. Specify a set of parameters for SageMaker model parallelism.-
For a complete list of common parameters, see Parameters for
smdistributed
in the SageMaker Python SDK documentation. For TensorFlow, see TensorFlow-specific Parameters
. For PyTorch, see PyTorch-specific Parameters
. -
"pipeline_parallel_degree"
(or"partitions"
insmdistributed-modelparallel<v1.6.0
) – Required. Among the parameters forsmdistributed
, this parameter is required to specify how many model partitions you want to split into. Important
There is a breaking change in the parameter name. The
"pipeline_parallel_degree"
parameter replaces the"partitions"
sincesmdistributed-modelparallel
v1.6.0. For more information, see Common Parametersfor SageMaker model parallelism configuration and SageMaker Distributed Model Parallel Release Notes in the SageMaker Python SDK documentation.
-
-
-
For the
"mpi"
key, pass a dictionary that contains the following:-
"enabled"
– Required. SetTrue
to launch the distributed training job with MPI. -
"processes_per_host"
– Required. Specify the number of processes MPI should launch on each host. In SageMaker, a host is a single Amazon EC2 ML instance. The SageMaker Python SDK maintains a one-to-one mapping between processes and GPUs across model and data parallelism. This means that SageMaker schedules each process on a single, separate GPU and no GPU contains more than one process. If you are using PyTorch, you must restrict each process to its own device throughtorch.cuda.set_device(smp.local_rank())
. To learn more, see Automated splitting with PyTorch.Important
process_per_host
must not be greater than the number of GPUs per instance and typically will be equal to the number of GPUs per instance. -
"custom_mpi_options"
(optional) – Use this key to pass any custom MPI options you might need. If you do not pass any MPI custom options to the key, the MPI option is set by default to the following flag.--mca btl_vader_single_copy_mechanism none
Note
You do not need to explicitly specify this default flag to the key. If you explicitly specify it, your distributed model parallel training job might fail with the following error:
The following MCA parameter has been listed multiple times on the command line: MCA param: btl_vader_single_copy_mechanism MCA parameters can only be listed once on a command line to ensure there is no ambiguity as to its value. Please correct the situation and try again.
Tip
If you launch a training job using an EFA-enabled instance type, such as
ml.p4d.24xlarge
andml.p3dn.24xlarge
, use the following flag for best performance:-x FI_EFA_USE_DEVICE_RDMA=1 -x FI_PROVIDER=efa -x RDMAV_FORK_SAFE=1
-
To launch the training job using the estimator and your SageMaker model parallel
configured training script, run the estimator.fit()
function.
Use the following resources to learn more about using the model parallelism features in the SageMaker Python SDK:
-
We recommend you use a SageMaker notebook instance if you are new users. To see an example of how you can launch a training job using a SageMaker notebook instance, see Amazon SageMaker model parallelism library v2 examples.
-
You can also submit a distributed training job from your machine using AWS CLI. To set up AWS CLI on your machine, see set up your AWS credentials and Region for development.
Extend a Pre-built Docker Container that Contains SageMaker's Distributed Model Parallel Library
To extend a pre-built container and use SageMaker's model parallelism library, you must
use one of the available AWS Deep Learning Containers (DLC) images for PyTorch or
TensorFlow. The SageMaker model parallelism library is included in the TensorFlow (2.3.0 and later)
and PyTorch (1.6.0 and later) DLC images with CUDA (cuxyz
). For a
complete list of DLC images, see Available Deep Learning Containers Images
Tip
We recommend that you use the image that contains the latest version of TensorFlow or PyTorch to access the most up-to-date version of the SageMaker model parallelism library.
For example, your Dockerfile should contain a FROM
statement similar
to the following:
# Use the SageMaker DLC image URI for TensorFlow or PyTorch FROM
aws-dlc-account-id
.dkr.ecr.aws-region
.amazonaws.com/framework
-training:{framework-version-tag}
# Add your dependencies here RUN...
ENV PATH="/opt/ml/code:${PATH}
" # this environment variable is used by the SageMaker container to determine our user code directory. ENV SAGEMAKER_SUBMIT_DIRECTORY /opt/ml/code
Additionally, when you define a PyTorch or TensorFlow estimator, you must specify that
the entry_point
for your training script. This should be the same path
identified with ENV SAGEMAKER_SUBMIT_DIRECTORY
in your Dockerfile.
Tip
You must push this Docker container to Amazon Elastic Container Registry (Amazon ECR) and use the image URI
(image_uri
) to define a SageMaker estimator for training. For more
information, see Extend a Pre-built
Container.
After you finish hosting the Docker container and retrieving the image URI of the
container, create a SageMaker PyTorch
estimator object as follows. This
example assumes that you have already defined smp_options
and
mpi_options
.
smd_mp_estimator = Estimator( entry_point="
your_training_script.py
", role=sagemaker.get_execution_role(), instance_type='ml.p3.16xlarge
', sagemaker_session=sagemaker_session, image_uri='your_aws_account_id
.dkr.ecr.region
.amazonaws.com/name
:tag
' instance_count=1
,distribution={ "smdistributed": smp_options, "mpi": mpi_options },
base_job_name="SMD-MP-demo
", ) smd_mp_estimator.fit('s3://my_bucket/my_training_data/')
Create Your Own Docker Container with the SageMaker Distributed Model Parallel Library
To build your own Docker container for training and use the SageMaker model parallel library, you must include the correct dependencies and the binary files of the SageMaker distributed parallel libraries in your Dockerfile. This section provides the minimum set of code blocks you must include to properly prepare a SageMaker training environment and the model parallel library in your own Docker container.
Note
This custom Docker option with the SageMaker model parallel library as a binary is available only for PyTorch.
To create a Dockerfile with the SageMaker training toolkit and the model parallel library
-
Start with one of the NVIDIA CUDA base images
. FROM
<cuda-cudnn-base-image>
Tip
The official AWS Deep Learning Container (DLC) images are built from the NVIDIA CUDA base images
. We recommend you look into the official Dockerfiles of AWS Deep Learning Container for PyTorch to find which versions of the libraries you need to install and how to configure them. The official Dockerfiles are complete, benchmark tested, and managed by the SageMaker and Deep Learning Container service teams. In the provided link, choose the PyTorch version you use, choose the CUDA ( cuxyz
) folder, and choose the Dockerfile ending with.gpu
or.sagemaker.gpu
. -
To set up a distributed training environment, you need to install software for communication and network devices, such as Elastic Fabric Adapter (EFA), NVIDIA Collective Communications Library (NCCL)
, and Open MPI . Depending on the PyTorch and CUDA versions you choose, you must install compatible versions of the libraries. Important
Because the SageMaker model parallel library requires the SageMaker data parallel library in the subsequent steps, we highly recommend that you follow the instructions at Create your own Docker container with the SageMaker distributed data parallel library to properly set up a SageMaker training environment for distributed training.
For more information about setting up EFA with NCCL and Open MPI, see Get started with EFA and MPI and Get started with EFA and NCCL.
-
Add the following arguments to specify the URLs of the SageMaker distributed training packages for PyTorch. The SageMaker model parallel library requires the SageMaker data parallel library to use the cross-node Remote Direct Memory Access (RDMA).
ARG SMD_MODEL_PARALLEL_URL=https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/pytorch-1.10.0/build-artifacts/2022-02-21-19-26/smdistributed_modelparallel-1.7.0-cp38-cp38-linux_x86_64.whl ARG SMDATAPARALLEL_BINARY=https://smdataparallel.s3.amazonaws.com/binary/pytorch/1.10.2/cu113/2022-02-18/smdistributed_dataparallel-1.4.0-cp38-cp38-linux_x86_64.whl
-
Install dependencies that the SageMaker model parallel library requires.
-
Install the METIS
library. ARG METIS=metis-
5.1.0
RUN rm /etc/apt/sources.list.d/* \ && wget -nv http://glaros.dtc.umn.edu/gkhome/fetch/sw/metis/${METIS}.tar.gz \ && gunzip -f ${METIS}.tar.gz \ && tar -xvf ${METIS}.tar \ && cd ${METIS} \ && apt-get update \ && make config shared=1 \ && make install \ && cd .. \ && rm -rf ${METIS}.tar* \ && rm -rf ${METIS} \ && rm -rf /var/lib/apt/lists/* \ && apt-get clean -
Install the RAPIDS Memory Manager library
. This requires CMake 3.14 or later. ARG RMM_VERSION=
0.15.0
RUN wget -nv https://github.com/rapidsai/rmm/archive/v${RMM_VERSION}.tar.gz \ && tar -xvf v${RMM_VERSION}.tar.gz \ && cd rmm-${RMM_VERSION} \ && INSTALL_PREFIX=/usr/local ./build.sh librmm \ && cd .. \ && rm -rf v${RMM_VERSION}.tar* \ && rm -rf rmm-${RMM_VERSION}
-
-
Install the SageMaker model parallel library.
RUN pip install --no-cache-dir -U ${SMD_MODEL_PARALLEL_URL}
-
Install the SageMaker data parallel library.
RUN SMDATAPARALLEL_PT=1 pip install --no-cache-dir ${SMDATAPARALLEL_BINARY}
-
Install the sagemaker-training toolkit
. The toolkit contains the common functionality that's necessary to create a container compatible with the SageMaker training platform and the SageMaker Python SDK. RUN pip install sagemaker-training
-
After you finish creating the Dockerfile, see Adapting Your Own Training Container to learn how to build the Docker container and host it in Amazon ECR.
Tip
For more general information about creating a custom Dockerfile for training in SageMaker, see Use Your Own Training Algorithms.