Run a SageMaker Distributed Training Job with Model Parallelism

Learn how to run a model-parallel training job of your own training script using the SageMaker Python SDK with the SageMaker model parallelism library.

There are three use-case scenarios for running a SageMaker training job.

You can use one of the pre-built AWS Deep Learning Container for TensorFlow and PyTorch. This option is recommended if it is the first time for you to use the model parallel library. To find a tutorial for how to run a SageMaker model parallel training job, see the example notebooks at PyTorch training with Amazon SageMaker's model parallelism library.
You can extend the pre-built containers to handle any additional functional requirements for your algorithm or model that the pre-built SageMaker Docker image doesn't support. To find an example of how you can extend a pre-built container, see Extend a Pre-built Container.
You can adapt your own Docker container to work with SageMaker using the SageMaker Training toolkit. For an example, see Adapting Your Own Training Container.

For options 2 and 3 in the preceding list, refer to Extend a Pre-built Docker Container that Contains SageMaker's Distributed Model Parallel Library to learn how to install the model parallel library in an extended or customized Docker container.

In all cases, you launch your training job configuring a SageMaker TensorFlow or PyTorch estimator to activate the library. To learn more, see the following topics.

Topics

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Support for FlashAttention

Step 1: Modify Your Own Training Script