Supported Frameworks and AWS Regions - Amazon SageMaker

Supported Frameworks and AWS Regions

Before using the SageMaker model parallel library, check the supported frameworks and instance types, and determine if there are enough quotas in your AWS account and AWS Region.

Supported Frameworks

The SageMaker model parallel library supports the following deep learning frameworks and is available in AWS Deep Learning Containers (DLC) or downloadable as a binary file.

PyTorch versions supported by SageMaker and the SageMaker distributed model parallel library
PyTorch version SageMaker distributed model parallel library version smdistributed-modelparallel integrated DLC image URI URL of the binary file**
v1.12.0* smdistributed-modelparallel==v1.11.0

763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.12.0-gpu-py38-cu113-ubuntu20.04-sagemaker

https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/pytorch-1.12.0/build-artifacts/2022-08-12-16-58/smdistributed_modelparallel-1.11.0-cp38-cp38-linux_x86_64.whl
v1.11.0* smdistributed-modelparallel==v1.10.0

763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.11.0-gpu-py38-cu113-ubuntu20.04-sagemaker

https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/pytorch-1.11.0/build-artifacts/2022-07-11-19-23/smdistributed_modelparallel-1.10.0-cp38-cp38-linux_x86_64.whl
v1.10.2* smdistributed-modelparallel==v1.7.0

763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.10.2-gpu-py38-cu113-ubuntu20.04-sagemaker

-
v1.10.0 smdistributed-modelparallel==v1.5.0

763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.10.0-gpu-py38-cu113-ubuntu20.04-sagemaker

-
v1.9.1 smdistributed-modelparallel==v1.4.0

763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.9.1-gpu-py38-cu111-ubuntu20.04

-
v1.8.1* smdistributed-modelparallel==v1.6.0

763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.8.1-gpu-py36-cu111-ubuntu18.04

-

* The SageMaker distributed model parallel library v1.6.0 and later provides extended features for PyTorch. For more information, see Extended Features of the SageMaker Model Parallel Library for PyTorch.

** The URLs of the binary files are for installing the SageMaker distributed model parallelism library in custom containers. For more information, see Create Your Own Docker Container with the SageMaker Distributed Model Parallel Library.

TensorFlow versions supported by SageMaker and the SageMaker distributed model parallel library
TensorFlow version SageMaker distributed model parallel library version smdistributed-modelparallel integrated DLC image URI
v2.6.0 smdistributed-modelparallel==v1.4.0 763104351884.dkr.ecr.<region>.amazonaws.com/tensorflow-training:2.6.0-gpu-py38-cu112-ubuntu20.04
v2.5.1 smdistributed-modelparallel==v1.4.0 763104351884.dkr.ecr.<region>.amazonaws.com/tensorflow-training:2.5.1-gpu-py37-cu112-ubuntu18.04

Hugging Face Transformers versions supported by SageMaker and the SageMaker distributed data parallel library

The AWS Deep Learning Containers for Hugging Face use the SageMaker Training Containers for PyTorch and TensorFlow as their base images. To look up the Hugging Face Transformers library versions and paired PyTorch and TensorFlow versions, see the latest Hugging Face Containers and the Prior Hugging Face Container Versions.

Note

To check the latest updates and release history of the library, see the SageMaker Distributed Model Parallel Release Notes in the SageMaker Python SDK documentation.

AWS Regions

The SageMaker data parallel library is available in all of the AWS Regions where the AWS Deep Learning Containers for SageMaker are in service. For more information, see Available Deep Learning Containers Images.

Supported Instance Types

The SageMaker model parallelism library requires one of the following ML instance types.

Instance type
ml.p3.16xlarge
ml.p3dn.24xlarge
ml.p4d.24xlarge
ml.g4dn.12xlarge

For specs of the instance types, see the Accelerated Computing section in the Amazon EC2 Instance Types page. For information about instance pricing, see Amazon SageMaker Pricing.

If you encountered an error message similar to the following, follow the instructions at Request a service quota increase for SageMaker resources.

ResourceLimitExceeded: An error occurred (ResourceLimitExceeded) when calling the CreateTrainingJob operation: The account-level service limit 'ml.p3dn.24xlarge for training job usage' is 0 Instances, with current utilization of 0 Instances and a request delta of 1 Instances. Please contact AWS support to request an increase for this limit.