Amazon SageMaker Distributed Training Notebook Examples - Amazon SageMaker

Amazon SageMaker Distributed Training Notebook Examples

The following case studies and notebooks provide examples of implementing the SageMaker distributed training libraries for the supported deep learning frameworks (PyTorch, TensorFlow, and HuggingFace) and models, such as CNN and MaskRCNN for vision, and BERT for natural language processing.

These notebooks are provided in the SageMaker examples GitHub repository. You can also browse them on the SageMaker examples website.

The examples are set up to use p3.16xlarge instances for the worker nodes, but you may choose ml.p3dn.24xlarge or ml.p4d.24xlarge instance types for which the SageMaker distributed training libraries are optimized. You can test the notebooks using a cluster of a single node; however, to see a performance improvement as shown in the Training Benchmarks section, use a cluster of multiple nodes (two or more). The examples call out the section in which you modify this configuration.

Blogs and Case Studies

The following blogs discuss case studies about using the SageMaker distributed training libraries.

PyTorch Examples

SageMaker Distributed Data Parallel

SageMaker Distributed Model Parallel

TensorFlow Examples

SageMaker Distributed Data Parallel

SageMaker Distributed Model Parallel

HuggingFace Examples

The following HuggingFace on SageMaker examples are available in the HuggingFace notebooks repository.

SageMaker Distributed Data Parallel

SageMaker Distributed Model Parallel

How to Access or Download the SageMaker Distributed Training Notebook Examples

Follow instructions to access or download the SageMaker distributed training example notebooks.

Option 1: Use a SageMaker notebook instance

To use the aforementioned examples, we recommend that you use an Amazon SageMaker notebook instance. A notebook instance runs Jupyter Notebook and JupyterServer apps on Amazon EC2 instances, which are optimized for machine learning. If you do not have an active notebook instance, follow the instructions in Create a Notebook Instance in the SageMaker developer guide to create one.

After you have created an instance, in the Notebook instances page of the SageMaker console, do the following:

  1. Open JupyterLab.

  2. Select the examples icon ( ) in the left tray.

  3. Browse the examples for Training and look for notebooks titled Distributed Data Parallel or Distributed Model Parallel.

Option 2: Clone the SageMaker example repository to SageMaker Studio or notebook instance

To download and use the aforementioned example notebooks, do the following to clone the example GitHub repositories:

  1. Open a terminal.

  2. In the command line, navigate to the SageMaker folder.

    $ cd SageMaker
  3. Clone the SageMaker examples GitHub repository.

    git clone https://github.com/aws/amazon-sagemaker-examples.git
    Note

    To download the HuggingFace example notebooks, clone the HuggingFace notebooks GitHub repository:

    git clone https://github.com/huggingface/notebooks huggingface-notebooks
  4. In the JupyterLab interface, navigate into the amazon-sagemaker-examples folder.

  5. In the training/distributed_training folder, there are folders for frameworks, and in each of these, there are folders for data_parallel and model_parallel. Choose the example of your choice and follow the instructions to launch distributed training with an SageMaker distributed training library.