Select your cookie preferences

We use essential cookies and similar tools that are necessary to provide our site and services. We use performance cookies to collect anonymous statistics, so we can understand how customers use our site and make improvements. Essential cookies cannot be deactivated, but you can choose “Customize” or “Decline” to decline performance cookies.

If you agree, AWS and approved third parties will also use cookies to provide useful site features, remember your preferences, and display relevant content, including relevant advertising. To accept or decline all non-essential cookies, choose “Accept” or “Decline.” To make more detailed choices, choose “Customize.”

Run a SageMaker Distributed Training Job with Model Parallelism

Focus mode
Run a SageMaker Distributed Training Job with Model Parallelism - Amazon SageMaker AI

Learn how to run a model-parallel training job of your own training script using the SageMaker Python SDK with the SageMaker model parallelism library.

There are three use-case scenarios for running a SageMaker training job.

  1. You can use one of the pre-built AWS Deep Learning Container for TensorFlow and PyTorch. This option is recommended if it is the first time for you to use the model parallel library. To find a tutorial for how to run a SageMaker model parallel training job, see the example notebooks at PyTorch training with Amazon SageMaker AI's model parallelism library.

  2. You can extend the pre-built containers to handle any additional functional requirements for your algorithm or model that the pre-built SageMaker Docker image doesn't support. To find an example of how you can extend a pre-built container, see Extend a Pre-built Container.

  3. You can adapt your own Docker container to work with SageMaker AI using the SageMaker Training toolkit. For an example, see Adapting Your Own Training Container.

For options 2 and 3 in the preceding list, refer to Extend a Pre-built Docker Container that Contains SageMaker's Distributed Model Parallel Library to learn how to install the model parallel library in an extended or customized Docker container.

In all cases, you launch your training job configuring a SageMaker TensorFlow or PyTorch estimator to activate the library. To learn more, see the following topics.

PrivacySite termsCookie preferences
© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved.