Supported frameworks and AWS Regions
Before using the SageMaker model parallelism library v2 (SMP v2), check the supported frameworks and instance types and determine if there are enough quotas in your AWS account and AWS Region.
Note
To check the latest updates and release notes of the library, see Release notes for the SageMaker model parallelism library.
Supported frameworks
SMP v2 supports the following deep learning frameworks and available through SMP Docker containers and an SMP Conda channel. When you use the framework estimator classes in the SageMaker Python SDK and specify distribution configuration to use SMP v2, SageMaker automatically picks up the SMP Docker containers. To use SMP v2, we recommend that you always keep the SageMaker Python SDK up to date in your development environment.
PyTorch versions that the SageMaker model parallelism library supports
PyTorch version | SageMaker model parallelism library version | SMP Docker image URI |
---|---|---|
v2.3.1 | smdistributed-modelparallel==v2.4.0 |
658645717510.dkr.ecr. |
v2.2.0 | smdistributed-modelparallel==v2.3.0 |
658645717510.dkr.ecr. |
smdistributed-modelparallel==v2.2.0 |
Not available. Use the image of SMP v2.3.0, which is backward compatible. | |
v2.1.2 | smdistributed-modelparallel==v2.1.0 |
658645717510.dkr.ecr. |
v2.0.1 | smdistributed-modelparallel==v2.0.0 |
658645717510.dkr.ecr. |
SMP Conda channel
The following S3 bucket is a public Conda channel hosted by the SMP service team. If you want to install the SMP v2 library in an environment such as SageMaker HyperPod clusters, use this Conda channel to properly install the SMP library.
https://sagemaker-distributed-model-parallel.s3.
us-west-2
.amazonaws.com/smp-v2/
For more information about Conda channels in general, see Channels
Note
To find previous versions of the SMP library v1.x and pre-packaged DLCs, see Supported Frameworks in the SMP v1 documentation.
Use SMP v2 with open source libraries
The SMP v2 library works with other PyTorch-based open source libraries such as
PyTorch Lightning, Hugging Face Transformers, and Hugging Face Accelerate, because
SMP v2 is compatible with the PyTorch FSDP APIs. If you have further questions on
using the SMP library with other third party libraries, contact the SMP service team
at sm-model-parallel-feedback@amazon.com
.
AWS Regions
SMP v2 is available in the following AWS Regions. If you'd like to use the SMP Docker image URIs or the SMP Conda channel, check the following list and choose the AWS Region matching with yours, and update the image URI or the channel URL accordingly.
-
ap-northeast-1
-
ap-northeast-2
-
ap-northeast-3
-
ap-south-1
-
ap-southeast-1
-
ap-southeast-2
-
ca-central-1
-
eu-central-1
-
eu-north-1
-
eu-west-1
-
eu-west-2
-
eu-west-3
-
sa-east-1
-
us-east-1
-
us-east-2
-
us-west-1
-
us-west-2
Supported instance types
SMP v2 requires one of the following ML instance types.
Instance type |
---|
ml.p4d.24xlarge |
ml.p4de.24xlarge |
ml.p5.48xlarge |
Tip
Starting from SMP v2.2.0 supporting PyTorch v2.2.0 and later, Mixed precision training with FP8 on P5 instances using Transformer Engine is available.
For specs of the SageMaker machine learning instance types in general, see the Accelerated
Computing section in the Amazon EC2 Instance Types page
If you encountered an error message similar to the following, follow the instructions at Requesting a quota increase in the AWS Service Quotas User Guide.
ResourceLimitExceeded: An error occurred (ResourceLimitExceeded) when calling the CreateTrainingJob operation: The account-level service limit 'ml.p3dn.24xlarge for training job usage' is 0 Instances, with current utilization of 0 Instances and a request delta of 1 Instances. Please contact AWS support to request an increase for this limit.