Training - AWS Deep Learning Containers

Training

This section shows how to run training on AWS Deep Learning Containers for Amazon EC2 using MXNet, PyTorch, TensorFlow, and TensorFlow 2.

For a complete list of Deep Learning Containers, refer to Deep Learning Containers Images.

Note

MKL users: Read the AWS Deep Learning Containers Intel Math Kernel Library (MKL) Recommendations to get the best training or inference performance.

TensorFlow training

After you log into your Amazon EC2 instance, you can run TensorFlow and TensorFlow 2 containers with the following commands. You must use nvidia-docker for GPU images.

  • For CPU-based training, run the following.

    $ docker run -it <CPU training container>
  • For GPU-based training, run the following.

    $ nvidia-docker run -it <GPU training container>

The previous command runs the container in interactive mode and provides a shell prompt inside the container. You can then run the following to import TensorFlow.

$ python
>> import tensorflow

Press Ctrl+D to return to the bash prompt. Run the following to begin training:

git clone https://github.com/fchollet/keras.git
$ cd keras
$ python examples/mnist_cnn.py

Next steps

To learn inference on Amazon EC2 using TensorFlow with Deep Learning Containers, see TensorFlow Inference.

MXNet training

To begin training with MXNet from your Amazon EC2 instance, run the following command to run the container:

  • For CPU

    $ docker run -it <CPU training container>
  • For GPU

    $ nvidia-docker run -it <GPU training container>

In the terminal of the container, run the following to begin training.

  • For CPU

    $ git clone -b v1.4.x https://github.com/apache/incubator-mxnet.git python incubator-mxnet/example/image-classification/train_mnist.py
  • For GPU

    $ git clone -b v1.4.x https://github.com/apache/incubator-mxnet.git python incubator-mxnet/example/image-classification/train_mnist.py --gpus 0

MXNet training with GluonCV

In the terminal of the container, run the following to begin training using GluonCV. GluonCV v0.6.0 is included in the Deep Learning Containers.

  • For CPU

    $ git clone -b v0.6.0 https://github.com/dmlc/gluon-cv.git python gluon-cv/scripts/classification/cifar/train_cifar10.py --model resnet18_v1b
  • For GPU

    $ git clone -b v0.6.0 https://github.com/dmlc/gluon-cv.git python gluon-cv/scripts/classification/cifar/train_cifar10.py --num-gpus 1 --model resnet18_v1b

Next steps

To learn inference on Amazon EC2 using MXNet with Deep Learning Containers, see MXNet Inference .

PyTorch training

To begin training with PyTorch from your Amazon EC2 instance, use the following commands to run the container. You must use nvidia-docker for GPU images.

  • For CPU

    $ docker run -it <CPU training container>
  • For GPU

    $ nvidia-docker run -it <GPU training container>
  • If you have docker-ce version 19.03 or later, you can use the --gpus flag with docker:

    $ docker run -it --gpus <GPU training container>

Run the following to begin training.

  • For CPU

    $ git clone https://github.com/pytorch/examples.git $ python examples/mnist/main.py --no-cuda
  • For GPU

    $ git clone https://github.com/pytorch/examples.git $ python examples/mnist/main.py

PyTorch distributed GPU training with NVIDIA Apex

NVIDIA Apex is a PyTorch extension with utilities for mixed precision and distributed training. For more information on the utilities offered with Apex, see the NVIDIA Apex website. Apex is currently supported by Amazon EC2 instances in the following families:

To begin distributed training using NVIDIA Apex, run the following in the terminal of the GPU training container. This example requires at least two GPUs on your Amazon EC2 instance to run parallel distributed training.

$ git clone https://github.com/NVIDIA/apex.git && cd apex $ python -m torch.distributed.launch --nproc_per_node=2 examples/simple/distributed/distributed_data_parallel.py

Next steps

To learn inference on Amazon EC2 using PyTorch with Deep Learning Containers, see PyTorch Inference .