Troubleshooting scaling issues - AWS ParallelCluster

Troubleshooting scaling issues

This section is relevant to clusters that were installed using AWS ParallelCluster version 3.0.0 and later with the Slurm job scheduler. For more information about configuring multiple queues, see Configuration of multiple queues.

If one of your running clusters is experiencing issues, place the cluster in a STOPPED state by running the following command before you begin to troubleshoot. This prevents incurring any unexpected costs.

$ pcluster update-compute-fleet --cluster-name mycluster \ --status STOP_REQUESTED

You can list the log streams available from the cluster nodes by using the pcluster list-cluster-log-streams command and filtering by using the private-dns-name of one of the failing nodes or the head node:

$ pcluster list-cluster-log-streams --cluster-name mycluster --region eu-west-1 \ --filters 'Name=private-dns-name,Values=ip-10-0-0-101'

Then, you can retrieve the content of the log stream to analyze it by using the pcluster get-cluster-log-events command and passing the --log-stream-name corresponding to one of the key logs mentioned in the following section:

$ pcluster get-cluster-log-events --cluster-name mycluster \ --region eu-west-1 --log-stream-name ip-10-0-0-13.i-04e91cc1f4ea796fe.cfn-init

AWS ParallelCluster creates cluster CloudWatch log streams in log groups. You can view these logs in the CloudWatch console Custom Dashboards or Log groups. For more information, see Integration with Amazon CloudWatch Logs and Amazon CloudWatch dashboard.

Key logs for debugging

The following table provides an overview of the key logs for the head node:

  • /var/log/cfn-init.log - This is the AWS CloudFormation init log. It contains all commands that were run when an instance was set up. Use it to troubleshoot initialization issues.

  • /var/log/chef-client.log - This is the Chef client log. It contains all commands that were run through Chef/CINC. Use it to troubleshoot initialization issues.

  • /var/log/parallelcluster/slurm_resume.log - This is a ResumeProgram log. It launches instances for dynamic nodes. Use it to troubleshoot dynamic nodes launch issues.

  • /var/log/parallelcluster/slurm_suspend.log - This is the SuspendProgram log. It's called when instances are terminated for dynamic nodes. Use it to troubleshoot dynamic nodes termination issues. When you check this log, you should also check the clustermgtd log.

  • /var/log/parallelcluster/clustermgtd - This is the clustermgtd log. It runs as the centralized daemon that manages most cluster operation actions. Use it to troubleshoot any launch, termination, or cluster operation issues.

  • /var/log/slurmctld.log - This is the Slurm control daemon log. AWS ParallelCluster doesn't make scaling decisions. Rather, it only attempts to launch resources to satisfy the Slurm requirements. It's useful for scaling and allocation issues, job-related issues, and any scheduler-related launch and termination issues.

  • /var/log/parallelcluster/compute_console_output - This log records the console output from a sample subset of static compute nodes that have unexpectedly terminated. Use this log if static compute nodes terminate and the compute node logs aren't available in CloudWatch. The compute_console_output log content you receive is the same when you use the Amazon EC2 console or AWS CLI to retrieve the instance console output.

These are the key logs for the compute nodes:

  • /var/log/cloud-init-output.log - This is the cloud-init log. It contains all commands that were run when an instance was set up. Use it to troubleshoot initialization issues.

  • /var/log/parallelcluster/computemgtd - This is the computemgtd log. It runs on each compute node to monitor the node in the uncommon event that clustermgtd daemon on the head node is offline. Use it to troubleshoot unexpected termination issues.

  • /var/log/slurmd.log - This is the Slurm compute daemon log. Use it to troubleshoot initialization and compute failure issues.

Seeing InsufficientInstanceCapacity error in slurm_resume.log when I fail to run a job, or in clustermgtd.log when I fail to create a cluster

If the cluster uses a Slurm scheduler, you are experiencing an insufficient capacity issue. If there aren't enough instances available when an instance launch request is made, an InsufficientInstanceCapacity error is returned.

For static instance capacity, you can find the error in the clustermgtd log at /var/log/parallelcluster/clustermgtd.

For dynamic instance capacity, you can find the error in the ResumeProgram log at /var/log/parallelcluster/slurm_resume.log.

The message looks similar to the following example:

An error occurred (InsufficientInstanceCapacity) when calling the RunInstances/CreateFleet operation...

Based on your use case, consider using one of the following methods to avoid getting these types of error messages:

Troubleshooting node initialization issues

This section covers how you can troubleshoot node initialization issues. This includes issues where the node fails to launch, power up, or join a cluster.

Head node

Applicable logs:

  • /var/log/cfn-init.log

  • /var/log/chef-client.log

  • /var/log/parallelcluster/clustermgtd

  • /var/log/parallelcluster/slurm_resume.log

  • /var/log/slurmctld.log

Check the /var/log/cfn-init.log and /var/log/chef-client.log logs or corresponding log streams. These logs contain all the actions that were run when the head node was set up. Most errors that occur during setup should have error messages located in the /var/log/chef-client.log log. If OnNodeStart or OnNodeConfigured scripts are specified in the configuration of the cluster, double check that the script runs successfully through log messages.

When a cluster is created, the head node must wait for the compute nodes to join the cluster before it can join the cluster. Because of this, if the compute nodes fail to join the cluster, then the head node also fails. You can follow one of these sets of procedures, depending on the type of compute notes you use, to troubleshoot this type of issue:

Compute nodes

  • Applicable logs:

    • /var/log/cloud-init-output.log

    • /var/log/slurmd.log

  • If a compute node is launched, first check /var/log/cloud-init-output.log, which should contain the setup logs similar to the /var/log/chef-client.log log on the head node. Most errors that occur during setup should have error messages located at the /var/log/cloud-init-output.log log. If pre-install or post-install scripts are specified in cluster configuration, check that they ran successfully.

  • If you’re using a custom AMI with modification to the Slurm configuration, then there might be a Slurm-related error that prevents the compute node from joining the cluster. For scheduler-related errors, check the /var/log/slurmd.log log.

Dynamic compute nodes:

  • Search the ResumeProgram log (/var/log/parallelcluster/slurm_resume.log) for your compute node name to see if ResumeProgram was ever called with the node. (If ResumeProgram wasn’t ever called, you can check the slurmctld log (/var/log/slurmctld.log) to determine if Slurm ever tried to call ResumeProgram with the node).

  • Note that incorrect permissions for ResumeProgram might cause ResumeProgram to fail silently. If you’re using a custom AMI with modification to ResumeProgram setup, check that the ResumeProgram is owned by the slurm user and has the 744 (rwxr--r--) permission.

  • If ResumeProgram is called, check to see if an instance is launched for the node. If no instance was launched, you can see an error message that describes the launch failure.

  • If the instance is launched, then there might have been a problem during the setup process. You should see the corresponding private IP address and instance ID from the ResumeProgram log. Moreover, you can look at corresponding setup logs for the specific instance. For more information about troubleshooting a setup error with a compute node, see the next section.

Static compute nodes:

  • Check the clustermgtd (/var/log/parallelcluster/clustermgtd) log to see if instances were launched for the node. If they weren’t launched, there should be clear error message detailing the launch failure.

  • If instance is launched, there's some issue during setup process. You should see the corresponding private IP address and instance ID from the ResumeProgram log. Moreover, you can look at the corresponding setup logs for the specific instance.

Compute nodes backed by Spot Instances:

  • If it's the first time you use Spot Instances and the job remains in a PD (pending state), double check the /var/log/parallelcluster/slurm_resume.log file. You'll probably find an error like the following:

    2022-05-20 13:06:24,796 - [slurm_plugin.common:add_instances_for_nodes] - ERROR - Encountered exception when launching instances for nodes (x1) ['spot-dy-t2micro-2']: An error occurred (AuthFailure.ServiceLinkedRoleCreationNotPermitted) when calling the RunInstances operation: The provided credentials do not have permission to create the service-linked role for Amazon EC2 Spot Instances.

    When using Spot Instances, an AWSServiceRoleForEC2Spot service-linked role must exist in your account. To create this role in your account using the AWS CLI, run the following command:

    $ aws iam create-service-linked-role --aws-service-name spot.amazonaws.com

    For more information, see Working with Spot Instances in the AWS ParallelCluster User Guide and Service-linked role for Spot Instance requests in the Amazon EC2 User Guide.

Troubleshooting unexpected node replacements and terminations

This section continues to explore how you can troubleshoot node related issues, specifically when a node is replaced or terminated unexpectedly.

  • Applicable logs:

    • /var/log/parallelcluster/clustermgtd (head node)

    • /var/log/slurmctld.log (head node)

    • /var/log/parallelcluster/computemgtd (compute node)

Nodes replaced or terminated unexpectedly

  • Check in the clustermgtd log (/var/log/parallelcluster/clustermgtd) to see if clustermgtd replaced or terminated a node. Note that clustermgtd handles all normal node maintenance action.

  • If clustermgtd replaced or terminated the node, there should be a message detailing why this action was taken on the node. If the reason is scheduler related (for example, because the node is in DOWN), check in slurmctld log for more information. If the reason is Amazon EC2 related, there should be informative message detailing the Amazon EC2 related issue that required the replacement.

  • If clustermgtd didn’t terminate the node, first check if this was an expected termination by Amazon EC2 , more specifically a spot termination. computemgtd, running on a compute node, can also terminate a node if clustermgtd is determined as unhealthy. Check computemgtd log (/var/log/parallelcluster/computemgtd) to see if computemgtd terminated the node.

Nodes failed

  • Check in slurmctld log (/var/log/slurmctld.log) to see why a job or a node failed. Note that jobs are automatically re-queued if a node failed.

  • If slurm_resume reports that node is launched and clustermgtdreports after several minutes that there’s no corresponding instance in Amazon EC2 for that node, the node might fail during setup. To retrieve the log from a compute (/var/log/cloud-init-output.log), do the following steps:

    • Submit a job to let Slurm spin up a new node.

    • Wait for the compute node to start.

    • Modify the instance initiated shutdown behavior so that a failing compute node will be stopped rather than terminated.

      $ aws ec2 modify-instance-attribute \ --instance-id i-1234567890abcdef0 \ --instance-initiated-shutdown-behavior "{\"Value\": \"stop\"}"
    • Enable termination protection.

      $ aws ec2 modify-instance-attribute \ --instance-id i-1234567890abcdef0 \ --disable-api-termination
    • Tag the node to be easily identifiable.

      $ aws ec2 create-tags \ --resources i-1234567890abcdef0 \ --tags Key=Name,Value=QUARANTINED-Compute
    • Detach the node from the cluster by changing the parallelcluster:cluster-name tag.

      $ aws ec2 create-tags \ --resources i-1234567890abcdef0 \ --tags Key=parallelcluster:clustername,Value=QUARANTINED-ClusterName
    • Retrieve the console output from the node with this command.

      $ aws ec2 get-console-output --instance-id i-1234567890abcdef0 --output text

Replacing, terminating, or powering down problematic instances and nodes

  • Applicable logs:

    • /var/log/parallelcluster/clustermgtd (head node)

    • /var/log/parallelcluster/slurm_suspend.log (head node)

  • In most cases, clustermgtd handles all expected instance termination action. Check in the clustermgtd log to see why it failed to replace or terminate a node.

  • For dynamic nodes failing SlurmSettings Properties, check in the SuspendProgram log to see if SuspendProgram was called by slurmctld with the specific node as argument. Note that SuspendProgram doesn’t actually perform any action. Rather, it only logs when it’s called. All instance termination and NodeAddr reset is done by clustermgtd. Slurm puts nodes back into a POWER_SAVING state after SuspendTimeout automatically.

  • If compute nodes are failing continuously due to bootstrap failures, verify if they are being launched with Slurm cluster protected mode enabled. If protected mode isn't enabled, modify the protected mode settings to enable protected mode. Troubleshoot and fix the bootstrap script.

Queue (partition) Inactive status

If you run sinfo and the output shows queues with AVAIL status of inact, your cluster might have Slurm cluster protected mode enabled and the queue has been set to the INACTIVE state for a pre-defined period of time.

Troubleshooting other known node and job issues

Another type of known issue is that AWS ParallelCluster might fail to allocate jobs or make scaling decisions. With this type of issue, AWS ParallelCluster only launches, terminates, or maintains resources according to Slurm instructions. For these issues, check the slurmctld log to troubleshoot them.