Run a SageMaker Distributed Training Job with Model Parallelism - Amazon SageMaker

Run a SageMaker Distributed Training Job with Model Parallelism

Learn how to run a distributed model parallel training job using the SageMaker Python SDK with your own training script and SageMaker's distributed model parallel library.

There are three use-case scenarios for running a SageMaker training job:

  1. You can use one of the prebuilt AWS Deep Learning Container for TensorFlow and PyTorch. This option is recommended if it is the first time for you to use the model parallel library. To find a tutorial for how to run a SageMaker model parallel training job, see MNIST with PyTorch 1.6 and Amazon SageMaker's distributed model parallel library.

  2. You can extend the prebuilt containers to handle any additional functional requirements for your algorithm or model that the prebuilt SageMaker Docker image doesn't support. To find an example of how you can extend a pre-built container, see Extend a Prebuilt Container.

  3. You can adapt your own Docker container to work with SageMaker using the SageMaker Training toolkit. For an example, see Adapting Your Own Training Container.

For options 2 and 3 in the preceding list, refer to Extend a Prebuilt Docker Container that Contains SageMaker's Distributed Model Parallel Library to learn how to install the model parallel library in an extended or customized Docker container.

In all cases, you launch your training job configuring a SageMaker TensorFlow or PyTorch estimator to initialize the library. To learn more, see the following topics.