Menu
AWS Batch
User Guide

Creating a GPU Workload AMI

To run GPU workloads on your AWS Batch compute resources, you can start with the Deep Learning AMI (Amazon Linux) as a base AMI and configure it to be able to run AWS Batch jobs.

This deep learning AMI is based on Amazon Linux, so you can install the ecs-init package and make it compatible as a compute resource AMI. The nvidia-docker2 RPM installs the required components for Docker containers in AWS Batch jobs to be able to access the GPUs on supported instance types.

To configure the Deep Learning AMI for AWS Batch

  1. Launch a GPU instance type (for example, P3) with the Deep Learning AMI (Amazon Linux) in a region that AWS Batch supports.

  2. Connect to your instance with SSH. For more information, see Connecting to Your Linux Instance Using SSH in the Amazon EC2 User Guide for Linux Instances.

  3. With your favorite text editor, create a file called configure-gpu.sh with the following contents:

    #!/bin/bash # Install ecs-init, start docker, and install nvidia-docker 2 sudo yum install -y ecs-init sudo service docker start distribution=$(. /etc/os-release;echo $ID$VERSION_ID) curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | \ sudo tee /etc/yum.repos.d/nvidia-docker.repo sudo yum install -y nvidia-docker2 sudo pkill -SIGHUP dockerd # Run test container to verify installation sudo docker run --privileged --runtime=nvidia --rm nvidia/cuda nvidia-smi # Update Docker daemon.json to user nvidia-container-runtime by default sudo tee /etc/docker/daemon.json <<EOF { "runtimes": { "nvidia": { "path": "/usr/bin/nvidia-container-runtime", "runtimeArgs": [] } }, "default-runtime": "nvidia" } EOF sudo service docker restart
  4. Run the script.

    bash ./configure-gpu.sh
  5. Validate that you can run a Docker container and access the installed drivers with the following command.

    sudo docker run nvidia/cuda:latest nvidia-smi

    You should see something similar to the following output.

    +-----------------------------------------------------------------------------+ | NVIDIA-SMI 384.111 Driver Version: 384.111 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla V100-SXM2... On | 00000000:00:1E.0 Off | 0 | | N/A 32C P0 20W / 300W | 0MiB / 16152MiB | 1% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
  6. Remove any Docker containers and images on the instance to reduce the size of your AMI.

    1. Remove containers.

      sudo docker rm $(sudo docker ps -aq)
    2. Remove images.

      sudo docker rmi $(sudo docker images -q)
  7. If you started the Amazon ECS container agent on your instance, you must stop it and remove the persistent data checkpoint file before creating your AMI; otherwise, the agent will not start on instances that are launched from your AMI.

    1. Stop the Amazon ECS container agent.

      sudo stop ecs
    2. Remove the persistent data checkpoint file. By default, this file is located at /var/lib/ecs/data/ecs_agent_data.json. Use the following command to remove the file.

      sudo rm -rf /var/lib/ecs/data/ecs_agent_data.json
  8. Create a new AMI from your running instance. For more information, see Creating an Amazon EBS-Backed Linux AMI in the Amazon EC2 User Guide for Linux Instances guide.