Troubleshooting cluster health metrics - AWS ParallelCluster

Troubleshooting cluster health metrics

Cluster health metrics are added to the AWS ParallelCluster Amazon CloudWatch dashboard starting with AWS ParallelCluster version 3.6.0. In the following sections, you can learn about the dashboard health metrics, and actions you can take to troubleshoot and resolve issues.

Seeing the Instance Provisioning Errors graph

If you see a non-zero value in the Instance Provisioning Errors graph, then it means that the Amazon EC2 instance for backing slurm nodes failed to launch on the CreateFleet or RunInstance API.

Seeing IAMPolicyErrors

  • What happened?

    A number of instances failed to launch, which is caused by insufficient permissions with error code UnauthorizedOperation.

  • How to resolve?

    If you have a configured a custom InstanceRole or InstanceProfile, check your IAM policies and verify that you are using the correct credentials.

    Check the clustermgtd file for static node error details. Check the slurm_resume.log file for dynamic node error details. Use the details to learn more about the missing permissions that must be added.

Seeing VcpuLimitErrors

  • What happened?

    AWS ParallelCluster failed to launch instances because it reached the vCPU limit on your AWS account for a specific Amazon EC2 instance type that you configured for cluster compute nodes.

  • How to resolve?

    Check for the VcpuLimitExceeded error in the clustermgtd file for static nodes, and check in the slurm_resume.log file for dynamic nodes to get additional details. To resolve this issue, you can request an increase to your vCPU limits. For more information about how to view current limits and request new limits, see Amazon Elastic Compute Cloud service quotas in the Amazon Elastic Compute Cloud User Guide for Linux Instances.

Seeing VolumeLimitErrors

  • What happened?

    You have reached your Amazon EBS volume limit on your AWS account, and AWS ParallelCluster is unable to launch instances with error code InsufficientVolumeCapacity or VolumeLimitExceeded.

  • How to resolve?

    Check the clustermgtd file for static nodes, and check the slurm_resume.log file for dynamic nodes to get additional volume limit details. To resolve this issue, you can use a different AWS Region, clean up existing volumes, or contact the AWS Support Center to submit a request to increase your Amazon EBS volume limit.

Seeing InsufficientCapacityErrors

OtherInstanceLaunchFailures

  • What happened?

    The Amazon EC2 instance for backing compute nodes failed to launch with the CreateFleet or RunInstance API.

  • How to resolve?

    Check the clustermgtd file for static nodes, and check the slurm_resume.log file for dynamic nodes to get error details.

Seeing the Unhealthy Instance Errors graph

Seeing InstanceBootstrapTimeoutError

  • What happened?

    An instance can't join the cluster within the resume_timeout (for dynamic nodes) or node_replacement_timeout (for static nodes). This can occur if the network isn't configured correctly for the compute nodes, or it can occur if custom scripts running on the compute node take too long to finish.

  • How to resolve?

    For dynamic nodes, check the clustermgtd log (/var/log/parallelcluster/clustermgtd) for the compute node IP address and errors such as the following:

    Node bootstrap error: Resume timeout expires for node

    For static nodes, check the clustermgtd log (/var/log/parallelcluster/clustermgtd) for the compute node IP address and errors such as the following:

    Node bootstrap error: Replacement timeout expires for node ... in replacement.

    For additional details, check the /var/log/cloud-init-output.log file for errors. You can retrieve problematic compute node IP addresses from the clustermgtd and slurm_resume log files.

Seeing EC2HealthCheckErrors

Seeing ScheduledEventHealthCheckErrors

  • What happened?

    An instance failed an Amazon EC2 scheduled event health check, and it's unhealthy.

  • How to resolve?

    For information about how to troubleshoot this issue, see Scheduled events for your instances.

Seeing NoCorrespondingInstanceErrors

  • What happened?

    AWS ParallelCluster can't find instances backing nodes. The nodes have likely self-terminated during bootstrap operations. SlurmQueues / CustomActions / OnNodeStart | OnNodeConfigured script, or network errors can produce NoCorrespondingInstanceErrors.

  • How to resolve?

    For additional details, check the /var/log/cloud-init-output.log for the compute node.

Seeing the Compute Fleet Idle Time graph

Seeing a MaxDynamicNodeIdleTime that is significantly longer than the Idle Time Scaledown threshold

  • What happened?

    Your instance isn't terminating properly. MaxDynamicNodeIdleTime shows the maximum time in seconds that a dynamic node, backed by an Amazon EC2 instance, is idle. The Idle Time Scaledown threshold is derived from the cluster configuration ScaledownIdletime parameter. When a compute node has been idle for more than Idle Time Scaledown seconds, Slurm powers down the node and AWS ParallelCluster terminates the backing instance. In this case, something is preventing the instance termination.

  • How to resolve?

    For more information about this issue, see Replacing, terminating, or powering down problematic instances and nodes in Troubleshooting scaling issues.