Monitor GPUs with CloudWatch - Deep Learning AMI

Monitor GPUs with CloudWatch

When you use your DLAMI with a GPU you might find that you are looking for ways to track its usage during training or inference. This can be useful for optimizing your data pipeline, and tuning your deep learning network.

A utility called gpumon.py is preinstalled on your DLAMI. It integrates with CloudWatch and supports monitoring of per-GPU usage: GPU memory, GPU temperature, and GPU Power. The script periodically sends the monitored data to CloudWatch. You can configure the level of granularity for data being sent to CloudWatch by changing a few settings in the script. Before starting the script, however, you will need to setup CloudWatch to receive the metrics.

How to setup and run GPU monitoring with CloudWatch

  1. Create an IAM user, or modify an existing one to have a policy for publishing the metric to CloudWatch. If you create a new user please take note of the credentials as you will need these in the next step.

    The IAM policy to search for is “cloudwatch:PutMetricData”. The policy that is added is as follows:

    { "Version": "2012-10-17", "Statement": [ { "Action": [ "cloudwatch:PutMetricData" ], "Effect": "Allow", "Resource": "*" } ] }
    Tip

    For more information on creating an IAM user and adding policies for CloudWatch, refer to the CloudWatch documentation.

  2. On your DLAMI, run AWS configure and specify the IAM user credentials.

    $ aws configure
  3. You might need to make some modifications to the gpumon utility before you run it. You can find the gpumon utility and README in the location defined in the following code block. For more information on the gpumon.py script, see the Amazon S3 location of the script.

    Folder: ~/tools/GPUCloudWatchMonitor Files: ~/tools/GPUCloudWatchMonitor/gpumon.py ~/tools/GPUCloudWatchMonitor/README

    Options:

    • Change the region in gpumon.py if your instance is NOT in us-east-1.

    • Change other parameters such as the CloudWatch namespace or the reporting period with store_reso.

  4. Currently the script only supports Python 3. Activate your preferred framework’s Python 3 environment or activate the DLAMI general Python 3 environment.

    $ source activate python3
  5. Run the gpumon utility in background.

    (python3)$ python gpumon.py &
  6. Open your browser to the https://console.aws.amazon.com/cloudwatch/ then select metric. It will have a namespace 'DeepLearningTrain'.

    Tip

    You can change the namespace by modifying gpumon.py. You can also modify the reporting interval by adjusting store_reso.

The following is an example CloudWatch chart reporting on a run of gpumon.py monitoring a training job on p2.8xlarge instance.


        GPU monitoring on CloudWatch

You might be interested in these other topics on GPU monitoring and optimization: