Supported frameworks, AWS Regions, and instances types
Before using the SageMaker distributed data parallelism (SMDDP) library, check what are the supported ML frameworks and instance types and if there are enough quotas in your AWS account and AWS Region.
Supported frameworks
The following tables show the deep learning frameworks and their versions that SageMaker and
SMDDP support. The SMDDP library is available in SageMaker Framework Containers
Note
To check the latest updates and release notes of the SMDDP library, see the SageMaker data parallelism library release notes.
PyTorch
PyTorch version | SMDDP library version | SageMaker Framework Container images pre-installed with SMDDP | SMP Docker images pre-installed with SMDDP | URL of the binary file** |
---|---|---|---|---|
v2.3.0 | smdistributed-dataparallel==v2.3.0 |
763104351884.dkr.ecr. |
Currently not available | https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.3.0/cu121/2024-05-23/smdistributed_dataparallel-2.3.0-cp311-cp311-linux_x86_64.whl |
v2.2.0 | smdistributed-dataparallel==v2.2.0 |
763104351884.dkr.ecr. |
658645717510.dkr.ecr. |
https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.2.0/cu121/2024-03-04/smdistributed_dataparallel-2.2.0-cp310-cp310-linux_x86_64.whl |
v2.1.0 | smdistributed-dataparallel==v2.1.0 |
763104351884.dkr.ecr. |
658645717510.dkr.ecr. |
https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.1.0/cu121/2024-02-04/smdistributed_dataparallel-2.1.0-cp310-cp310-linux_x86_64.whl |
v2.0.1 | smdistributed-dataparallel==v2.0.1 |
763104351884.dkr.ecr. |
Not available | https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.0.1/cu118/2023-12-07/smdistributed_dataparallel-2.0.2-cp310-cp310-linux_x86_64.whl |
v2.0.0 | smdistributed-dataparallel==v1.8.0 |
763104351884.dkr.ecr. |
Not available | https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.0.0/cu118/2023-03-20/smdistributed_dataparallel-1.8.0-cp310-cp310-linux_x86_64.whl |
v1.13.1 | smdistributed-dataparallel==v1.7.0 |
763104351884.dkr.ecr. |
Not available | https://smdataparallel.s3.amazonaws.com/binary/pytorch/1.13.1/cu117/2023-01-09/smdistributed_dataparallel-1.7.0-cp39-cp39-linux_x86_64.whl |
v1.12.1 | smdistributed-dataparallel==v1.6.0 |
763104351884.dkr.ecr. |
Not available | https://smdataparallel.s3.amazonaws.com/binary/pytorch/1.12.1/cu113/2022-12-05/smdistributed_dataparallel-1.6.0-cp38-cp38-linux_x86_64.whl |
v1.12.0 | smdistributed-dataparallel==v1.5.0 |
763104351884.dkr.ecr. |
Not available | https://smdataparallel.s3.amazonaws.com/binary/pytorch/1.12.0/cu113/2022-07-01/smdistributed_dataparallel-1.5.0-cp38-cp38-linux_x86_64.whl |
v1.11.0 | smdistributed-dataparallel==v1.4.1 |
763104351884.dkr.ecr. |
Not available | https://smdataparallel.s3.amazonaws.com/binary/pytorch/1.11.0/cu113/2022-04-14/smdistributed_dataparallel-1.4.1-cp38-cp38-linux_x86_64.whl |
** The URLs of the binary files are for installing the SMDDP library in custom containers. For more information, see Create your own Docker container with the SageMaker distributed data parallel library.
Note
The SMDDP library is available in AWS Regions where the SageMaker Framework Containers
Note
The SMDDP library v1.4.0 and later works as a backend of PyTorch distributed
(torch.distributed) data parallelism (torch.parallel.DistributedDataParallel). In
accordance with the change, the following smdistributed APIs
-
smdistributed.dataparallel.torch.distributed
is deprecated. Use the torch.distributedpackage instead. -
smdistributed.dataparallel.torch.parallel.DistributedDataParallel
is deprecated. Use the torch.nn.parallel.DistributedDataParallelAPI instead.
If you need to use the previous versions of the library (v1.3.0 or before), see the
archived SageMaker distributed data parallelism documentation
PyTorch Lightning
The SMDDP library is available for PyTorch Lightning in the following SageMaker Framework Containers for PyTorch and the SMP Docker containers.
PyTorch Lightning v2
PyTorch Lightning version | PyTorch version | SMDDP library version | SageMaker Framework Container images pre-installed with SMDDP | SMP Docker images pre-installed with SMDDP | URL of the binary file** |
---|---|---|---|---|---|
2.2.5 | 2.3.0 | smdistributed-dataparallel==v2.3.0 |
763104351884.dkr.ecr. |
Currently not available | https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.3.0/cu121/2024-05-23/smdistributed_dataparallel-2.3.0-cp311-cp311-linux_x86_64.whl |
2.2.0 | 2.2.0 | smdistributed-dataparallel==v2.2.0 |
763104351884.dkr.ecr. |
658645717510.dkr.ecr. |
https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.2.0/cu121/2024-03-04/smdistributed_dataparallel-2.2.0-cp310-cp310-linux_x86_64.whl |
2.1.2 | 2.1.0 | smdistributed-dataparallel==v2.1.0 |
763104351884.dkr.ecr. |
658645717510.dkr.ecr. |
https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.1.0/cu121/2024-02-04/smdistributed_dataparallel-2.1.0-cp310-cp310-linux_x86_64.whl |
2.1.0 | 2.0.1 | smdistributed-dataparallel==v2.0.1 |
763104351884.dkr.ecr. |
Not available | https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.0.1/cu118/2023-12-07/smdistributed_dataparallel-2.0.2-cp310-cp310-linux_x86_64.whl |
PyTorch Lightning v1
PyTorch Lightning version | PyTorch version | SMDDP library version | SageMaker Framework Container images pre-installed with SMDDP | URL of the binary file** |
---|---|---|---|---|
1.7.2 1.7.0 1.6.4 1.6.3 1.5.10 |
1.12.0 | smdistributed-dataparallel==v1.5.0 |
763104351884.dkr.ecr.<region> .amazonaws.com/pytorch-training:1.12.0-gpu-py38-cu113-ubuntu20.04-sagemaker |
https://smdataparallel.s3.amazonaws.com/binary/pytorch/1.12.0/cu113/2022-07-01/smdistributed_dataparallel-1.5.0-cp38-cp38-linux_x86_64.whl |
** The URLs of the binary files are for installing the SMDDP library in custom containers. For more information, see Create your own Docker container with the SageMaker distributed data parallel library.
Note
PyTorch Lightning and its utility libraries such as Lightning Bolts are not preinstalled
in the PyTorch DLCs. When you construct a SageMaker PyTorch estimator and submit a training
job request in Step 2, you need to provide requirements.txt
to install
pytorch-lightning
and lightning-bolts
in the SageMaker PyTorch
training container.
# requirements.txt pytorch-lightning lightning-bolts
For more information about specifying the source directory to place the
requirements.txt
file along with your training script and a job submission,
see Using third-party libraries
Hugging Face Transformers
The AWS Deep Learning Containers for Hugging Face use the SageMaker Training Containers for
PyTorch and TensorFlow as their base images. To look up the Hugging Face Transformers library
versions and paired PyTorch and TensorFlow versions, see the latest Hugging Face Containers
TensorFlow (deprecated)
Important
The SMDDP library discontinued support for TensorFlow and is no longer available in DLCs for TensorFlow later than v2.11.0. The following table lists previous DLCs for TensorFlow with the SMDDP library installed.
TensorFlow version | SMDDP library version |
---|---|
2.9.1, 2.10.1, 2.11.0 |
smdistributed-dataparallel==v1.4.1
|
2.8.3 |
smdistributed-dataparallel==v1.3.0
|
AWS Regions
The SMDDP library is available in all of the AWS Regions where the AWS Deep Learning Containers for SageMaker
Supported instance types
The SMDDP library requires one of the following instance types.
Instance type |
---|
ml.p3dn.24xlarge * |
ml.p4d.24xlarge |
ml.p4de.24xlarge |
Tip
To properly run distributed training on the EFA-enabled instance types, you should enable traffic between the instances by setting up the security group of your VPC to allow all inbound and outbound traffic to and from the security group itself. To learn how to set up the security group rules, see Step 1: Prepare an EFA-enabled security group in the Amazon EC2 User Guide.
Important
* The SMDDP library has discontinued support for optimizing its collective communication
operations on P3 instances. While you can still utilize the SMDDP optimized
AllReduce
collective on ml.p3dn.24xlarge
instances, there will
be no further development support to enhance performance on this instance type. Note that
the SMDDP optimized AllGather
collective is only available for P4
instances.
For specs of the instance types, see the Accelerated
Computing section in the Amazon EC2
Instance Types page
If you encountered an error message similar to the following, follow the instructions at Request a service quota increase for SageMaker resources.
ResourceLimitExceeded: An error occurred (ResourceLimitExceeded) when calling the CreateTrainingJob operation: The account-level service limit 'ml.p3dn.24xlarge for training job usage' is 0 Instances, with current utilization of 0 Instances and a request delta of 1 Instances. Please contact AWS support to request an increase for this limit.