Amazon SageMaker Distributed Training Notebook Examples - Amazon SageMaker

Amazon SageMaker Distributed Training Notebook Examples

The following case studies and notebooks provide examples of implementing the SageMaker distributed training libraries for the supported deep learning frameworks (PyTorch, TensorFlow, and HuggingFace) and models, such as CNN and MaskRCNN for vision, and BERT for natural language processing.

These notebooks are provided in the SageMaker examples GitHub repository. You can also browse them on the SageMaker examples website.

Blogs and Case Studies

The following blogs discuss case studies about using the SageMaker distributed training libraries.

The SageMaker data parallelism library

The SageMaker model parallelism library

PyTorch Examples

The SageMaker data parallelism library

The SageMaker model parallelism library

TensorFlow Examples

The SageMaker data parallelism library

The SageMaker model parallelism library

HuggingFace Examples

The following HuggingFace on SageMaker examples are available in the HuggingFace notebooks repository.

The SageMaker data parallelism library

The SageMaker model parallelism library

How to Access or Download the SageMaker Distributed Training Notebook Examples

Follow instructions to access or download the SageMaker distributed training example notebooks.

Option 1: Use a SageMaker notebook instance

To use the aforementioned examples, we recommend that you use an Amazon SageMaker notebook instance. A notebook instance runs Jupyter Notebook and JupyterServer apps on Amazon EC2 instances, which are optimized for machine learning. If you do not have an active notebook instance, follow the instructions in Create a Notebook Instance in the SageMaker developer guide to create one.

After you have created an instance, in the Notebook instances page of the SageMaker console, do the following:

  1. Open JupyterLab.

  2. Select the examples icon ( ) in the left tray.

  3. Browse the examples for Training and look for notebooks titled Distributed Data Parallel or Distributed Model Parallel.

Option 2: Clone the SageMaker example repository to SageMaker Studio or notebook instance

To download and use the aforementioned example notebooks, do the following to clone the example GitHub repositories:

  1. Open a terminal.

  2. In the command line, navigate to the SageMaker folder.

    cd SageMaker
  3. Clone the SageMaker examples GitHub repository.

    git clone https://github.com/aws/amazon-sagemaker-examples.git
    Note

    To download the HuggingFace example notebooks, clone the HuggingFace notebooks GitHub repository:

    git clone https://github.com/huggingface/notebooks huggingface-notebooks
  4. In the JupyterLab interface, navigate into the amazon-sagemaker-examples folder.

  5. In the training/distributed_training folder, there are folders for frameworks, and in each of these, there are folders for data_parallel and model_parallel. Choose the example of your choice and follow the instructions to launch distributed training with an SageMaker distributed training library.