Health checks done by the SageMaker HyperPod health-monitoring agent Logs generated by the SageMaker HyperPod health-monitoring agent

SageMaker HyperPod health-monitoring agent

SageMaker HyperPod health-monitoring agent continuously monitors the health status of each GPU-based or Trainium-based instances. When it detects any instance or GPU failures, the agent marks the instance as unhealthy.

Health checks done by the SageMaker HyperPod health-monitoring agent

The SageMaker HyperPod health-monitoring agent checks the following.

NVIDIA GPUs

DCGM policy violation notifications
Errors in the nvidia-smi output
Various errors in the logs generated by the Amazon Elastic Compute Cloud (EC2) platform
GPU Count validation — if there’s a mismatch between the expected number of GPUs in a particular instance type (for example: 8 GPUs in ml.p5.48xlarge instance type) and the count returned by nvidia-smi, then HMA reboots the node

AWS Trainium

Errors in the output from the AWS Neuron monitor
Outputs generated by the Neuron node problem detector (For more information about the AWS Neuron node problem detector, see Node problem detection and recovery for AWS Neuron nodes within Amazon EKS clusters.)
Various errors in the logs generated by the Amazon EC2 platform
Neuron Device Count validation — if there’s a mismatch between the actual number of neuron device count in a particular instance type and the count returned by neuron-ls, then HMA reboots the node

Logs generated by the SageMaker HyperPod health-monitoring agent

The SageMaker HyperPod health-monitoring agent is an out-of-the-box health check feature and continuously runs on all HyperPod clusters. The health monitoring agent publishes detected health events on GPU or Trn instances to CloudWatch under the Cluster log group /aws/sagemaker/Clusters/.

The detection logs from the HyperPod health monitoring agent are created as separate log streams named SagemakerHealthMonitoringAgent for each node. You can query the detection logs using CloudWatch log insights as follows.


fields @timestamp, @message
| filter @message like /HealthMonitoringAgentDetectionEvent/

This should return an output similar to the following.


2024-08-21T11:35:35.532-07:00
    {"level":"info","ts":"2024-08-21T18:35:35Z","msg":"NPD caught event: %v","details: ":{"severity":"warn","timestamp":"2024-08-22T20:59:29Z","reason":"XidHardwareFailure","message":"Node condition NvidiaErrorReboot is now: True, reason: XidHardwareFailure, message: \"NVRM: Xid (PCI:0000:b9:00): 71, pid=<unknown>, name=<unknown>, NVLink: fatal error detected on link 6(0x10000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)\""},"HealthMonitoringAgentDetectionEvent":"HealthEvent"}
2024-08-21T11:35:35.532-07:00
    {"level":"info","ts":"2024-08-21T18:35:35Z","msg":"NPD caught event: %v","details: ":{"severity":"warn","timestamp":"2024-08-22T20:59:29Z","reason":"XidHardwareFailure","message":"Node condition NvidiaErrorReboot is now: True, reason: XidHardwareFailure, message: \"NVRM: Xid (PCI:0000:b9:00): 71, pid=<unknown>, name=<unknown>, NVLink: fatal error detected on link 6(0x10000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)\""},"HealthMonitoringAgentDetectionEvent":"HealthEvent"}

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Cluster resiliency

Basic health checks