Collect NVIDIA GPU metrics
You can use the CloudWatch agent to collect NVIDIA GPU metrics from Linux servers. To set this up,
add a nvidia_gpu
section inside the metrics_collected
section of the
CloudWatch agent configuration file. For more information, see Linux section.
Additionally, the instance must have an NVIDIA driver installed. NVIDIA drivers on pre-installed on some Amazon Machine Images (AMIs). Otherwise, you can manually install the driver. For more information, see Install NVIDIA drivers on Linux instances.
The following metrics can be collected. All of these metrics are collected with no CloudWatch Unit
,
but you can specify a unit for each metric by adding a parameter to the CloudWatch agent configuration file.
For more information, see Linux section.
Metric | Metric name in CloudWatch | Description |
---|---|---|
|
|
The percentage of time over the past sample period during which one or more kernals on the GPU was running. |
|
|
The core GPU temperature in degrees Celsius. |
|
|
The last measured power draw for the entire board, in watts. |
|
|
The percentage of time over the past sample period during which global (device) memory was being read or written. |
|
|
The percentage of maximum fan speed that the device's fan is currently intended to run at. |
|
|
Reported total memory, in MB. |
|
|
Memory used, in MB. |
|
|
Memory free, in MB. |
|
|
The current link generation. |
|
|
The current link width. |
|
|
Current number of encoder sessions. |
|
|
The moving average of the encode frames per second. |
|
|
The moving average of the encode latency in microseconds. |
|
|
The current frequency of the graphics (shader) clock. |
|
|
The current frequency of the Streaming Multiprocessor (SM) clock. |
|
|
The current frequency of the memory clock. |
|
|
The current frequency of the video (encoder plus decoder) clocks. |
All of these metrics are collected with the following dimensions:
Dimension | Description |
---|---|
|
A unique identifier for the GPU on this server. Represents the NVIDIA Management Library (NVML) index of the device. |
|
The type of GPU. For example, |
|
The server host name. |