Amazon SageMaker Distributed Training Notebook Examples
The following case studies and notebooks provide examples of implementing the SageMaker distributed training libraries for the supported deep learning frameworks (PyTorch, TensorFlow, and HuggingFace) and models, such as CNN and MaskRCNN for vision, and BERT for natural language processing.
These notebooks are provided in the SageMaker examples GitHub repository
Blogs and Case Studies
The following blogs discuss case studies about using the SageMaker distributed training libraries.
The SageMaker data parallelism library
-
Enable faster training with Amazon SageMaker data parallel library
, AWS Machine Learning Blog (December 05, 2023) How I trained 10TB for Stable Diffusion on SageMaker
in Medium (November 29, 2022) Run PyTorch Lightning and native PyTorch DDP on Amazon SageMaker Training, featuring Amazon Search
, AWS Machine Learning Blog (August 18, 2022) Training YOLOv5 on AWS with PyTorch and the SageMaker distributed data parallel library
, Medium (May 6, 2022) Speed up EfficientNet model training on SageMaker with PyTorch and the SageMaker distributed data parallel library
, Medium (March 21, 2022) Speed up EfficientNet training on AWS with the SageMaker distributed data parallel library
, Towards Data Science (January 12, 2022) -
Hyundai reduces ML model training time for autonomous driving models using Amazon SageMaker
, AWS Machine Learning Blog (June 25, 2021) -
Distributed Training: Train BART/T5 for Summarization using Transformers and Amazon SageMaker
, the Hugging Face website (April 8, 2021)
The SageMaker model parallelism library
New performance improvements in the Amazon SageMaker model parallelism library
, AWS Machine Learning Blog (December 16, 2022) -
Train gigantic models with near-linear scaling using sharded data parallelism on Amazon SageMaker
, AWS Machine Learning Blog (October 31, 2022)
PyTorch Examples
The SageMaker data parallelism library
The SageMaker model parallelism library
TensorFlow Examples
The SageMaker data parallelism library
The SageMaker model parallelism library
HuggingFace Examples
The following HuggingFace on SageMaker examples are available in the HuggingFace
notebooks repository
The SageMaker data parallelism library
The SageMaker model parallelism library
How to Access or Download the SageMaker Distributed Training Notebook Examples
Follow instructions to access or download the SageMaker distributed training example notebooks.
Option 1: Use a SageMaker notebook instance
To use the aforementioned examples, we recommend that you use an Amazon SageMaker notebook instance. A notebook instance runs Jupyter Notebook and JupyterServer apps on Amazon EC2 instances, which are optimized for machine learning. If you do not have an active notebook instance, follow the instructions in Create a Notebook Instance in the SageMaker developer guide to create one.
After you have created an instance, in the Notebook instances page of the SageMaker console, do the following:
-
Open JupyterLab.
-
Select the examples icon (
) in the left tray.
-
Browse the examples for Training and look for notebooks titled Distributed Data Parallel or Distributed Model Parallel.
Option 2: Clone the SageMaker example repository to SageMaker Studio or notebook instance
To download and use the aforementioned example notebooks, do the following to clone the example GitHub repositories:
-
Open a terminal.
-
In the command line, navigate to the SageMaker folder.
cd SageMaker
-
Clone the SageMaker examples GitHub repository
. git clone https://github.com/aws/amazon-sagemaker-examples.git
Note
To download the HuggingFace example notebooks, clone the HuggingFace notebooks GitHub repository
: git clone https://github.com/huggingface/notebooks huggingface-notebooks
-
In the JupyterLab interface, navigate into the
amazon-sagemaker-examples
folder. -
In the
training/distributed_training
folder, there are folders for frameworks, and in each of these, there are folders fordata_parallel
andmodel_parallel
. Choose the example of your choice and follow the instructions to launch distributed training with an SageMaker distributed training library.