Supported frameworks and AWS Regions - Amazon SageMaker

Supported frameworks and AWS Regions

Before using the SageMaker model parallelism library v2 (SMP v2), check the supported frameworks and instance types and determine if there are enough quotas in your AWS account and AWS Region.

Note

To check the latest updates and release notes of the library, see Release notes for the SageMaker model parallelism library.

Supported frameworks

SMP v2 supports the following deep learning frameworks and available through SMP Docker containers and an SMP Conda channel. When you use the framework estimator classes in the SageMaker Python SDK and specify distribution configuration to use SMP v2, SageMaker automatically picks up the SMP Docker containers. To use SMP v2, we recommend that you always keep the SageMaker Python SDK up to date in your development environment.

PyTorch versions that the SageMaker model parallelism library supports

PyTorch version SageMaker model parallelism library version SMP Docker image URI
v2.4.1 smdistributed-modelparallel==v2.6.1 658645717510.dkr.ecr.<us-west-2>.amazonaws.com/smdistributed-modelparallel:2.4.1-gpu-py311-cu121
smdistributed-modelparallel==v2.6.0
v2.3.1 smdistributed-modelparallel==v2.5.0 658645717510.dkr.ecr.us-west-2.amazonaws.com/smdistributed-modelparallel:2.3.1-gpu-py311-cu121
smdistributed-modelparallel==v2.4.0
v2.2.0 smdistributed-modelparallel==v2.3.0 658645717510.dkr.ecr.us-west-2.amazonaws.com/smdistributed-modelparallel:2.2.0-gpu-py310-cu121
smdistributed-modelparallel==v2.2.0
v2.1.2 smdistributed-modelparallel==v2.1.0 658645717510.dkr.ecr.us-west-2.amazonaws.com/smdistributed-modelparallel:2.1.2-gpu-py310-cu121
v2.0.1 smdistributed-modelparallel==v2.0.0 658645717510.dkr.ecr.us-west-2.amazonaws.com/smdistributed-modelparallel:2.0.1-gpu-py310-cu121

SMP Conda channel

The following S3 bucket is a public Conda channel hosted by the SMP service team. If you want to install the SMP v2 library in an environment such as SageMaker HyperPod clusters, use this Conda channel to properly install the SMP library.

https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/smp-v2/

For more information about Conda channels in general, see Channels in the Conda documentation.

Note

To find previous versions of the SMP library v1.x and pre-packaged DLCs, see Supported Frameworks in the SMP v1 documentation.

Use SMP v2 with open source libraries

The SMP v2 library works with other PyTorch-based open source libraries such as PyTorch Lightning, Hugging Face Transformers, and Hugging Face Accelerate, because SMP v2 is compatible with the PyTorch FSDP APIs. If you have further questions on using the SMP library with other third party libraries, contact the SMP service team at sm-model-parallel-feedback@amazon.com.

AWS Regions

SMP v2 is available in the following AWS Regions. If you'd like to use the SMP Docker image URIs or the SMP Conda channel, check the following list and choose the AWS Region matching with yours, and update the image URI or the channel URL accordingly.

  • ap-northeast-1

  • ap-northeast-2

  • ap-northeast-3

  • ap-south-1

  • ap-southeast-1

  • ap-southeast-2

  • ca-central-1

  • eu-central-1

  • eu-north-1

  • eu-west-1

  • eu-west-2

  • eu-west-3

  • sa-east-1

  • us-east-1

  • us-east-2

  • us-west-1

  • us-west-2

Supported instance types

SMP v2 requires one of the following ML instance types.

Instance type
ml.p4d.24xlarge
ml.p4de.24xlarge
ml.p5.48xlarge
ml.p5e.48xlarge
Tip

Starting from SMP v2.2.0 supporting PyTorch v2.2.0 and later, Mixed precision training with FP8 on P5 instances using Transformer Engine is available.

For specs of the SageMaker machine learning instance types in general, see the Accelerated Computing section in the Amazon EC2 Instance Types page. For information about instance pricing, see Amazon SageMaker Pricing.

If you encountered an error message similar to the following, follow the instructions at Requesting a quota increase in the AWS Service Quotas User Guide.

ResourceLimitExceeded: An error occurred (ResourceLimitExceeded) when calling the CreateTrainingJob operation: The account-level service limit 'ml.p3dn.24xlarge for training job usage' is 0 Instances, with current utilization of 0 Instances and a request delta of 1 Instances. Please contact AWS support to request an increase for this limit.