AWS ParallelCluster troubleshooting - AWS ParallelCluster

AWS ParallelCluster troubleshooting

The AWS ParallelCluster community maintains a Wiki page that provides many troubleshooting tips on the AWS ParallelCluster GitHub Wiki. For a list of known issues, see Known issues.

Retrieving and preserving logs

Logs are a useful resource for troubleshooting issues. Before you can use logs to troubleshoot issues for your AWS ParallelCluster resources, you should first create an archive of cluster logs. Follow the steps described in the Creating an Archive of a Cluster’s Logs topic on the AWS ParallelCluster GitHub Wiki to start this process.

If one of your running clusters is experiencing issues, you should place the cluster in a STOPPED state by running the pcluster stop <cluster_name> command before you begin to troubleshoot. This prevents incurring any unexpected costs.

If pcluster stops functioning or if you want to delete a cluster while still preserving its logs, run the pcluster delete —keep-logs <cluster_name> command. Running this command deletes the cluster yet retains the log group that’s stored in Amazon CloudWatch. For more information about this command, see the pcluster delete documentation.

Troubleshooting stack deployment issues

If your cluster fails to be created and rolls back stack creation, you can look through the following log files to diagose the issue. You want to look for the output of ROLLBACK_IN_PROGRESS in these logs. The failure message should look like the following:

$ pcluster create mycluster Creating stack named: parallelcluster-mycluster Status: parallelcluster-mycluster - ROLLBACK_IN_PROGRESS Cluster creation failed. Failed events: - AWS::EC2::Instance MasterServer Received FAILURE signal with UniqueId i-07af1cb218dd6a081

To diagnose the issue, create the cluster again using pcluster create, including the --norollback flag. Then, SSH into the cluster:

$ pcluster create mycluster --norollback ... $ pcluster ssh mycluster

After you're logged into the head node, you should find three primary log files that you can use to pinpoint the error.

  • /var/log/cfn-init.log is the log for the cfn-init script. First check this log. You're likely to see an error like Command chef failed in this log. Look at the lines immediately before this line for more specifics connected with the error message. For more information, see cfn-init.

  • /var/log/cloud-init.log is the log for cloud-init. If you don't see anything in cfn-init.log, then try checking this log next.

  • /var/log/cloud-init-output.log is the output of commands that were run by cloud-init. This includes the output from cfn-init. In most cases, you don't need to look at this log to troubleshoot this type of issue.

Troubleshooting issues in multiple queue mode clusters

This section is relevant to clusters that were installed using AWS ParallelCluster version 2.9.0 and later with the Slurm job scheduler. For more information about multiple queue mode, see Multiple queue mode.

Key logs

The following table provides an overview of the key logs for the head node:

/var/log/cfn-init.log

This is the AWS CloudFormation init log. It contains all commands that were run when an instance is set up. It's useful for troubleshooting initialization issues.

/var/log/chef-client.log

This is the Chef client log. It contains all commands that were run through Chef/CINC. It's useful for troubleshooting initialization issues.

/var/log/parallelcluster/slurm_resume.log

This is a ResumeProgram log. It launches instances for dynamic nodes and useful for troubleshooting dynamic nodes launch issues.

/var/log/parallelcluster/slurm_suspend.log

This is the SuspendProgram log. It's called when instances are terminated for dynamic nodes, and useful for troubleshooting dynamic nodes termination issues. When you check this log, you should also check the clustermgtd log.

/var/log/parallelcluster/clustermgtd

This is the clustermgtd log. It runs as the centralized daemon that manages most cluster operation actions. It's useful for troubleshooting any launch, termination, or cluster operation issue.

/var/log/slurmctld.log

This is the Slurm control daemon log. AWS ParallelCluster doesn't make scaling decisions. Rather, it only attempts to launch resources to satisfy the Slurm requirements. It's useful for scaling and allocation issues, job-related issues, and any scheduler-related launch and termination issues.

These are the Key notes for the Compute nodes:

/var/log/cloud-init-output.log

This is the cloud-init log. It contains all commands that were run when an instance is set up. It's useful for troubleshooting initialization issues.

/var/log/parallelcluster/computemgtd

This is the computemgtd log. It runs on each compute node to monitor the node in the rare event that computemgtd daemon on the head node is offline. It's useful for troubleshooting unexpected termination issues.

/var/log/slurmd.log

This is the Slurm compute daemon log. It's useful for troubleshooting initialization and compute failure related issues.

Troubleshooting node initialization issues

This section covers how you can troubleshoot node initialization issues. This includes issues where the node fails to launch, power up, or join a cluster.

Head node:

Applicable logs:

  • /var/log/cfn-init.log

  • /var/log/chef-client.log

  • /var/log/parallelcluster/clustermgtd

  • /var/log/parallelcluster/slurm_resume.log

  • /var/log/slurmctld.log

Check the /var/log/cfn-init.log and /var/log/chef-client.log logs. These logs should contain all the actions that were run when the head node is set up. Most errors that occur during setup should have a error messages located in the /var/log/chef-client.log log. If pre-install or post-install scripts are specified in the configuration of the cluster, double check that the script runs successfully through log messages.

When a cluster is created, the head node needs to wait for the compute nodes to join the cluster before it can join the cluster. As such, if the compute nodes fail to join the cluster, then the head node also fails. You can follow one of these sets of procedures, depending on the type of compute notes you use, to troubleshoot this type of issue:

Dynamic compute nodes:

  • Search the ResumeProgram log (/var/log/parallelcluster/slurm_resume.log) for your compute node name to see if ResumeProgram was ever called with the node. (If ResumeProgram wasn’t ever called, you can check the slurmctld log (/var/log/slurmctld.log) to determine if Slurm ever tried to call ResumeProgram with the node.)

  • Note that incorrect permissions for ResumeProgram might cause ResumeProgram to fail silently. If you’re using a custom AMI with modification to ResumeProgram setup, check that the ResumeProgram is owned by the slurm user and has the 744 (rwxr--r--) permission.

  • If ResumeProgram is called, check to see if an instance is launched for the node. If no instance was launched, you should be able to see an error message that describes the launch failure.

  • If the instance is launched, then there might have been a problem during the setup process. You should see the corresponding private IP address and instance ID from the ResumeProgram log. Moreover, you can look at corresponding setup logs for the specific instance. For more information about troubleshooting a setup error with a compute node, see the next section.

Static compute nodes:

  • Check the clustermgtd (/var/log/parallelcluster/clustermgtd) log to see if instances were launched for the node. If they weren’t launched, there should be clear error message detailing the launch failure.

  • If instance is launched, there's some issue during setup process. You should see the corresponding private IP address and instance ID from the ResumeProgram log. Moreover, you can look at the corresponding setup logs for the specific instance.

  • Compute nodes:

    • Applicable logs:

      • /var/log/cloud-init-output.log

      • /var/log/slurmd.log

    • If compute node is launched, first check /var/log/cloud-init-output.log, which should contain the setup logs similar to the /var/log/chef-client.log log on the head node. Most errors that occur during setup should have error messages located at the /var/log/cloud-init-output.log log. If pre-install or post-install scripts are specified in cluster configuration, check that they ran successfully.

    • If you’re using a custom AMI with modification to Slurm configuration, then there might be a Slurm related error that prevents the compute node from joining the cluster. For scheduler related errors, check the /var/log/slurmd.log log.

Troubleshooting unexpected node replacements and terminations

This section continues to explore how you can troubleshoot node related issues, specifically when a node is replaced or terminated unexpectedly.

  • Applicable logs:

    • /var/log/parallelcluster/clustermgtd (head node)

    • /var/log/slurmctld.log (head node)

    • /var/log/parallelcluster/computemgtd (compute node)

  • Nodes replaced or terminated unexpectedly

    • Check in the clustermgtd log (/var/log/parallelcluster/clustermgtd) to see if clustermgtd took the action to replace or terminate a node. Note that clustermgtd handles all normal node maintenance action.

    • If clustermgtd replaced or terminated the node, there should be a message detailing why this action was taken on the node. If the reason is scheduler related (for example, because the node is in DOWN), check in slurmctld log for more information. If the reason is Amazon EC2 related, there should be informative message detailing the Amazon EC2 related issue that required the replacement.

    • If clustermgtd didn’t terminate the node, first check if this was an expected termination by Amazon EC2, more specifically a spot termination. computemgtd, running on a Compute node, can also take an action to terminate a node if clustermgtd is determined as unhealthy. Check computemgtd log (/var/log/parallelcluster/computemgtd) to see if computemgtd terminated the node.

  • Nodes failed

    • Check in slurmctld log (/var/log/slurmctld.log) to see why a job or a node failed. Note that jobs are automatically re-queued if a node failed.

    • If slurm_resume reports that node is launched and clustermgtdreports after several minutes that there’s no corresponding instance in Amazon EC2 for that node, the node might fail during setup. To retrieve the log from a compute (/var/log/cloud-init-output.log), do the following steps:

      • Submit a job to let Slurm spin up a new node.

      • After the node starts, enable termination protection using this command.

        aws ec2 modify-instance-attribute --instance-id i-xyz --disable-api-termination
      • Retrieve the console output from the node with this command.

        aws ec2 get-console-output --instance-id i-xyz --output text

Replacing, terminating, or powering down problem instances and nodes

  • Applicable logs:

    • /var/log/parallelcluster/clustermgtd (head node)

    • /var/log/parallelcluster/slurm_suspend.log (head node)

  • In most cases, clustermgtd handles all expected instance termination action. Check in the clustermgtd log to see why it failed to replace or terminate a node.

  • For dynamic nodes failing scaledown_idletime, check in the SuspendProgram log to see if SuspendProgram was called by slurmctld with the specific node as argument. Note that SuspendProgram doesn’t actually perform any action. Rather, it only logs when it’s called. All instance termination and NodeAddr reset is done by clustermgtd. Slurm puts nodes back into a POWER_SAVING state after SuspendTimeout automatically.

Troubleshooting other known node and job issues

Another type of known issue is that AWS ParallelCluster might fail to allocate jobs or make scaling decisions. With this type of issue, AWS ParallelCluster only launches, terminates, or maintains resources according to Slurm instructions. For these issues, check the slurmctld log to troubleshoot these issues.

Troubleshooting issues in single queue mode clusters

This section applies to clusters that don’t have multiple queue mode with one of the following two configurations:

  • Launched using a AWS ParallelCluster version earlier than 2.9.0 and SGE, Torque, or Slurm job schedulers.

  • Launched using AWS ParallelCluster version 2.9.0 or later and SGE or Torque job schedulers.

Key logs

The following log files are the key logs for the head node.

For AWS ParallelCluster version 2.9.0 or later:

/var/log/chef-client.log

This is the CINC (chef) client log. It contains all commands that were run through CINC. It's useful for troubleshooting initialization issues.

Forall AWS ParallelCluster versions:

/var/log/cfn-init.log

This is the cfn-init log. It contains all commands that were run when an instance is set up, and is therefore useful for troubleshooting initialization issues. For more information, see cfn-init.

/var/log/jobwatcher

This is the jobwatcher log. jobwatcher monitors the scheduler queue and updates the Auto Scaling Group. It's useful for troubleshooting issues related to scaling up nodes.

/var/log/sqswatcher

This is the sqswatcher log. sqswatcher processes the instance ready event sent by a compute instance after successful initialization. It also adds compute nodes to the scheduler configuration. This log is useful for troubleshooting why a node or nodes failed to join a cluster.

The following are the key logs for the compute nodes.

AWS ParallelCluster version 2.9.0 or later

/var/log/cloud-init-output.log

This is the Cloud init log. It contains all commands that were run when an instance is set up. It's useful for troubleshooting initialization issues.

AWS ParallelCluster versions before 2.9.0

/var/log/cfn-init.log

This is the CloudFormation init log. It contains all commands that were run when an instance is set up. It's useful for troubleshooting initialization issues

All versions

/var/log/nodewatcher

This is the nodewatcher log. nodewatcher daemons that run on each Compute node. They scale down a node if it’s idle. This log is useful for any issues related to scaling down resources.

Troubleshooting failed launch and join operations

  • Applicable logs:

    • /var/log/cfn-init-cmd.log (head node and compute node)

    • /var/log/sqswatcher (head node)

  • If nodes failed to launch, check in the /var/log/cfn-init-cmd.log log to see the specific error message. In most cases, node launch failures are due to a setup failure.

  • If compute nodes failed to join the scheduler configuration despite successful setup, check the /var/log/sqswatcher log to see whether sqswatcher processed the event. Theseissues in most cases are because sqswatcher didn’t process the event.

Troubleshooting scaling issues

  • Applicable logs:

    • /var/log/jobwatcher (head node)

    • /var/log/nodewatcher (compute node)

  • Scale up issues: For the head node, check the /var/log/jobwatcher log to see if the jobwatcher daemon calculated the proper number of required nodes and updated the Auto Scaling Group. Note that jobwatcher monitors the scheduler queue and updates the Auto Scaling Group.

  • Scale down issues: For compute nodes, check the /var/log/nodewatcher log on the problem node to see why the node was scaled down. Note that nodewatcher daemons scale down a compute node if it’s idle.

One known issue is random compute note fails on large-scale clusters, specifically those with 500 or more compute nodes. This issue is related to a limitation of the scaling architecture of single queue cluster. If you want to use a large scale cluster, are using AWS ParallelCluster version v2.9.0 or later, are using Slurm, and want to avoid this issue, you should upgrade and switch to a multiple queue mode supported cluster. You can do so by running pcluster-config convert.

For ultra-large scale clusters, additional tuning to your system might be required. For more information, contact AWS Support.

Placement groups and instance launch issues

To get the lowest inter-node latency, use a placement group. A placement group guarantees that your instances are on the same networking backbone. If there aren't enough instances available when a request is made, an InsufficientInstanceCapacity error is returned. To reduce the possibility of receiving this error when using cluster placement groups, set the placement_group parameter to DYNAMIC and set the placement parameter to compute.

If you require a high performance shared filesystem, consider using Amazon FSx for Lustre.

If the head node must be in the placement group, use the same instance type and subnet for both the head as well as all of the compute nodes. By doing this, the compute_instance_type parameter has the same value as the master_instance_type parameter, the placement parameter is set to cluster, and the compute_subnet_id parameter isn't specified. With this configuration, the value of the master_subnet_id parameter is used for the compute nodes.

For more information, see Troubleshooting instance launch issues and Placement groups roles and limitations in the Amazon EC2 User Guide for Linux Instances

Directories that cannot be replaced

The following directories are shared between the nodes and cannot be replaced.

/home

This includes the default user home folder (/home/ec2_user on Amazon Linux, /home/centos on CentOS, and /home/ubuntu on Ubuntu).

/opt/intel

This includes Intel MPI, Intel Parallel Studio, and related files.

/opt/sge

This includes Son of Grid Engine and related files. (Conditional, only if scheduler = sge.)

/opt/slurm

This includes Slurm Workload Manager and related files. (Conditional, only if scheduler = slurm.)

/opt/torque

This includes Torque Resource Manager and related files. (Conditional, only if scheduler = torque.)

Troubleshooting issues in NICE DCV

The logs for NICE DCV are written to files in the /var/log/dcv/ directory. Reviewing these logs can help to troubleshoot issues.

The instance type should have at least 1.7 gibibyte (GiB) of RAM to run NICE DCV. Nano and micro instance types don't have enough memory to run NICE DCV.

Troubleshooting issues in clusters with AWS Batch integration

This section is relevant to clusters with AWS Batch scheduler integration.

Head node issues

Head node related setup issues can be troubleshooted in the same way as single queue cluster. For more information about these issues, see Troubleshooting issues in single queue mode clusters.

AWS Batch multi-node parallel jobs submission issues

If you have problems submitting multi-node parallel jobs when using AWS Batch as the job scheduler, you should upgrade to AWS ParallelCluster version 2.5.0. If that's not feasible, you can use the workaround that's detailed in the topic: Self patch a cluster used for submitting multi-node parallel jobs through AWS Batch.

Compute issues

AWS Batch manages the scaling and compute aspects of your services. If you encounter compute related issues, see the AWS Batch troubleshooting documentation for help.

Job failures

If a job fails, you can run the awsbout command to retrieve the job output. You can also run the awsbstat -d command to obtain a link to the job logs stored by Amazon CloudWatch.

Additional support

For a list of known issues, see the main GitHub Wiki page or the issues page. For more urgent issues, contact AWS Support or open a new GitHub issue.