Training for Amazon EC2 - AWS Deep Learning Containers

Training for Amazon EC2

This section shows how to run training on AWS Deep Learning Containers for Amazon EC2 using Apache MXNet (Incubating), PyTorch, TensorFlow, and TensorFlow 2.

For a complete list of Deep Learning Containers, refer to Deep Learning Containers images.

Note

MKL users: Read the AWS Deep Learning Containers Intel Math Kernel Library (MKL) recommendations to get training or inference performance.

TensorFlow training

After you log in to your Amazon EC2 instance, you can run TensorFlow and TensorFlow 2 containers with the following commands. You must use nvidia-docker for GPU images.

  • For CPU-based training, run the following.

    $ docker run -it <CPU training container>
  • For GPU-based training, run the following.

    $ nvidia-docker run -it <GPU training container>

The previous command runs the container in interactive mode and provides a shell prompt inside the container. You can then run the following to import TensorFlow.

$ python
>> import tensorflow

Press Ctrl+D to return to the bash prompt. Run the following to begin training:

git clone https://github.com/fchollet/keras.git
$ cd keras
$ python examples/mnist_cnn.py

Next steps

To learn inference on Amazon EC2 using TensorFlow with Deep Learning Containers, see TensorFlow inference.

Apache MXNet (Incubating) training

To begin training with Apache MXNet (Incubating) from your Amazon EC2 instance, run the following command to run the container:

  • For CPU

    $ docker run -it <CPU training container>
  • For GPU

    $ nvidia-docker run -it <GPU training container>

In the terminal of the container, run the following to begin training.

  • For CPU

    $ git clone -b v1.4.x https://github.com/apache/incubator-mxnet.git python incubator-mxnet/example/image-classification/train_mnist.py
  • For GPU

    $ git clone -b v1.4.x https://github.com/apache/incubator-mxnet.git python incubator-mxnet/example/image-classification/train_mnist.py --gpus 0

MXNet training with GluonCV

In the terminal of the container, run the following to begin training using GluonCV. GluonCV v0.6.0 is included in the Deep Learning Containers.

  • For CPU

    $ git clone -b v0.6.0 https://github.com/dmlc/gluon-cv.git python gluon-cv/scripts/classification/cifar/train_cifar10.py --model resnet18_v1b
  • For GPU

    $ git clone -b v0.6.0 https://github.com/dmlc/gluon-cv.git python gluon-cv/scripts/classification/cifar/train_cifar10.py --num-gpus 1 --model resnet18_v1b

Next steps

To learn inference on Amazon EC2 using MXNet with Deep Learning Containers, see Apache MXNet (incubating) inference .

PyTorch training

To begin training with PyTorch from your Amazon EC2 instance, use the following commands to run the container. You must use nvidia-docker for GPU images.

  • For CPU

    $ docker run -it <CPU training container>
  • For GPU

    $ nvidia-docker run -it <GPU training container>
  • If you have docker-ce version 19.03 or later, you can use the --gpus flag with Docker:

    $ docker run -it --gpus <GPU training container>

Run the following to begin training.

  • For CPU

    $ git clone https://github.com/pytorch/examples.git $ python examples/mnist/main.py --no-cuda
  • For GPU

    $ git clone https://github.com/pytorch/examples.git $ python examples/mnist/main.py

PyTorch distributed GPU training with NVIDIA Apex

NVIDIA Apex is a PyTorch extension with utilities for mixed precision and distributed training. For more information about the utilities offered with Apex, see the NVIDIA Apex website. Apex is currently supported by Amazon EC2 instances in the following families:

To begin distributed training using NVIDIA Apex, run the following in the terminal of the GPU training container. This example requires at least two GPUs on your Amazon EC2 instance to run parallel distributed training.

$ git clone https://github.com/NVIDIA/apex.git && cd apex $ python -m torch.distributed.launch --nproc_per_node=2 examples/simple/distributed/distributed_data_parallel.py

Amazon S3 Plugin for PyTorch

Deep Learning Containers include a plugin that you can use to use data from an Amazon S3 bucket for PyTorch training.

  1. To begin using the Amazon S3 plugin in Deep Learning Containers, verify that your Amazon EC2 instance has full access to Amazon S3. Create an IAM role that grants Amazon S3 access to an Amazon EC2 instance and attach the role to your instance. You can use the AmazonS3FullAccess or AmazonS3ReadOnlyAccess policies.

  2. Set up your AWS_REGION environment variable with the AWS Region of your choice.

    export AWS_REGION=us-east-1
  3. Use the following commands to run a container that's compatible with the Amazon S3 plugin. You must use nvidia-docker for GPU images.

    • For CPU

      docker run -it 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.8.1-cpu-py36-ubuntu18.04-v1.6
    • For GPU

      nvidia-docker run -it 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.8.1-gpu-py36-cu111-ubuntu18.04-v1.7
  4. Run the following to test an example.

    git clone https://github.com/aws/amazon-s3-plugin-for-pytorch.git cd amazon-s3-plugin-for-pytorch/examples python s3_cv_iterable_shuffle_example.py

For more information and additional examples, see the Amazon S3 Plugin for PyTorch repository.

Next steps

To learn inference on Amazon EC2 using PyTorch with Deep Learning Containers, see PyTorch inference