Trying to create a cluster - AWS ParallelCluster

Trying to create a cluster

When using AWS ParallelCluster version 3.5.0 and later to create a cluster, and a cluster creation failed with --rollback-on-failure set to false, use the pcluster describe-cluster CLI command to get status and failure information. In this case, the expected clusterStatus of the pcluster describe-cluster output is CREATE_FAILED. Check the failures section in the output to find the failureCode and failureReason. Then, in the following section, find the matching failureCode for additional troubleshooting help. For more information, see pcluster describe-cluster.

In the following sections, we recommend that you check the logs on the head node, such as the /var/log/cfn-init.log and /var/log/chef-client.log files. For more information about AWS ParallelCluster logs and how to view them, see Key logs for debugging and Retrieving and preserving logs.

If you don't have a failureCode, navigate to the AWS CloudFormation console to view the cluster stack. Check the Status Reason for the HeadNodeWaitCondition or failures on other resources to find additional failure details. For more information, see View AWS CloudFormation events on CREATE_FAILED. Check the /var/log/cfn-init.log and /var/log/chef-client.log files on the head node.

failureCode is OnNodeConfiguredExecutionFailure

  • Why did it fail?

    You provided a custom script in OnNodeConfigured of the head node section in the configuration to create a cluster. However, the custom script failed to run.

  • How to resolve?

    Check the /var/log/cfn-init.log file to learn more about the failure and how to fix the issue in your custom script. Near the end of this log, you might see run information related to the OnNodeConfigured script after the Running command runpostinstall message.

failureCode is OnNodeConfiguredDownloadFailure

  • Why did it fail?

    You provided a custom script in OnNodeConfigured of the head node section in the configuration to create a cluster. However, the custom script failed to download.

  • How to resolve?

    Make sure that the URL is valid and that the access is correctly configured. For more information on the configuration of custom bootstrap scripts, see Custom bootstrap actions.

    Check the /var/log/cfn-init.logfile. Near the end of this log, you might see run information related to OnNodeConfigured script processing, including downloading, after the Running command runpostinstall message.

failureCode is OnNodeConfiguredFailure

  • Why did it fail?

    You provided a custom script in OnNodeConfigured of the head node section in the configuration to create a cluster. However, the use of the custom script failed in the cluster deployment. An immediate cause can't be determined and additional investigation is needed.

  • How to resolve?

    Check the /var/log/cfn-init.logfile. Near the end of this log, you might see run information related to OnNodeConfigured script processing after the Running command runpostinstall message.

failureCode is OnNodeStartExecutionFailure

  • Why did it fail?

    You provided a custom script in OnNodeStart of the head node section in the configuration to create a cluster. However, the custom script failed to run.

  • How to resolve?

    Check the /var/log/cfn-init.log file to learn more about the failure and how to fix the issue in your custom script. Near the end of this log, you might see run information related to the OnNodeStart script after the Running command runpreinstall message.

failureCode is OnNodeStartDownloadFailure

  • Why did it fail?

    You provided a custom script in OnNodeStart of the head node section in the configuration to create a cluster. However, the custom script failed to download.

  • How to resolve?

    Make sure that the URL is valid and that the access is correctly configured. For more information on the configuration of custom bootstrap scripts, see Custom bootstrap actions.

    Check the /var/log/cfn-init.logfile. Near the end of this log, you might see run information related to OnNodeStart script processing, including downloading, after the Running command runpreinstall message.

failureCode is OnNodeStartFailure

  • Why did it fail?

    You provided a custom script in the OnNodeStart of the head node section in the configuration to create a cluster. However, the use of the custom script failed in the cluster deployment. An immediate cause can't be determined and additional investigation is needed.

  • How to resolve?

    Check the /var/log/cfn-init.logfile. Near the end of this log, you might see run information related to OnNodeStart script processing after the Running command runpreinstall message.

failureCode is EbsMountFailure

  • Why did it fail?

    The EBS volume defined in the cluster configuration failed to mount.

  • How to resolve?

    Check the /var/log/chef-client.log file for failure details.

failureCode is EfsMountFailure

  • Why did it fail?

    The Amazon EFS volume defined in the cluster configuration failed to mount.

  • How to resolve?

    If you defined an existing Amazon EFS file system, make sure that traffic is allowed between the cluster and the file system. For more information, see SharedStorage / EfsSettings / FileSystemId.

    Check the /var/log/chef-client.log file for failure details.

failureCode is FsxMountFailure

  • Why did it fail?

    The Amazon FSx file system defined in the cluster configuration failed to mount.

  • How to resolve?

    If you defined an existing Amazon FSx file system, make sure that traffic is allowed between the cluster and the file system. For more information, see SharedStorage / FsxLustreSettings / FileSystemId.

    Check the /var/log/chef-client.log file for failure details.

failureCode is RaidMountFailure

  • Why did it fail?

    The RAID volumes defined in the cluster configuration failed to mount.

  • How to resolve?

    Check the /var/log/chef-client.log file for failure details.

failureCode is AmiVersionMismatch

  • Why did it fail?

    The AWS ParallelCluster version used to create the custom AMI is different than the AWS ParallelCluster version used to configure the cluster. In the CloudFormation console, view the cluster CloudFormation stack details and check the Status Reason for the HeadNodeWaitCondition to get additional details on the AWS ParallelCluster versions and the AMI. For more information, see View AWS CloudFormation events on CREATE_FAILED.

  • How to resolve?

    Make sure the AWS ParallelCluster version used to create the custom AMI is the same AWS ParallelCluster version used to configure the cluster. You can change either the custom AMI version or the pcluster CLI version to make them the same.

failureCode is InvalidAmi

  • Why did it fail?

    The custom AMI is invalid because it wasn't built using AWS ParallelCluster.

  • How to resolve?

    Use the pcluster build-image command to create an AMI by making your AMI the parent image. For more information, see pcluster build-image.

failureCode is HeadNodeBootstrapFailure with failureReason Failed to set up the head node.

  • Why did it fail?

    An immediate cause can't be determined and additional investigation is needed. For example, it could be that the cluster is in protected status, and this could be caused by a failure to provision the static compute fleet.

  • How to resolve?

    Check the /var/log/chef-client.log. file for failure details.

    Note

    If you see RuntimeError exception Cluster state has been set to PROTECTED mode due to failures detected in static node provisioning, the cluster is in protected status. For more information, see How to debug protected mode.

failureCode is HeadNodeBootstrapFailure with failureReason Cluster creation timed out.

  • Why did it fail?

    By default, there is a 30 minute time limit for cluster creation to complete. If cluster creation hasn't completed within this time frame, the cluster creation fails with a timeout error. The cluster creation can timeout for different reasons. For example, timeout failures can be caused by a head node creation failure, a network issue, custom scripts that take too long to run in the head node, an error in a custom script that runs in compute nodes, or long wait times for compute node provisioning. An immediate cause can't be determined and additional investigation is needed.

  • How to resolve?

    Check the /var/log/cfn-init.log and /var/log/chef-client.log files for failure details. For more information about AWS ParallelCluster logs and how to get them, see Key logs for debugging and Retrieving and preserving logs.

    You might discover the following in these logs.

    • Seeing Waiting for static fleet capacity provisioning near the end of the chef-client.log

      This indicates that the cluster creation timed out when waiting for static nodes to power up. For more information, see Seeing errors in compute node initializations.

    • Seeing OnNodeConfigured or OnNodeStart head node script hasn't finished at the end of the cfn-init.log

      This indicates that the OnNodeConfigured or OnNodeStart custom script took a long time to run and caused a timeout error. Check your custom script for issues that might cause it to run for a long time. If your custom script requires a long time to run, consider changing the timeout limit by adding a DevSettings section to your cluster configuration file, as shown in the following example:

      DevSettings: Timeouts: HeadNodeBootstrapTimeout: 1800 # default setting: 1800 seconds
    • Can't find the logs, or the head node wasn't created successfully

      It's possible that the head node wasn't created successfully and the logs can't be found. In the CloudFormation console, view the cluster stack details to check for additional failure details.

failureCode is HeadNodeBootstrapFailure with failureReason Failed to bootstrap the head node.

  • Why did it fail?

    An immediate cause can't be determined and additional investigation is needed.

  • How to resolve?

    Check the /var/log/cfn-init.log and /var/log/chef-client.log files.

failureCode is ResourceCreationFailure

  • Why did it fail?

    The creation of some resources failed during the cluster creation process. The failure can occur for various reasons. For example, resource creation failures can be caused by capacity issues or a misconfigured IAM policy.

  • How to resolve?

    In the CloudFormation console, view the cluster stack to check for additional resource creation failure details.

failureCode is ClusterCreationFailure

  • Why did it fail?

    An immediate cause can't be determined and additional investigation is needed.

  • How to resolve?

    In the CloudFormation console, view the cluster stack and check the Status Reason for the HeadNodeWaitCondition to find additional failure details.

    Check the /var/log/cfn-init.log and /var/log/chef-client.log files.

Seeing WaitCondition timed out... in CloudFormation stack

For more information, see failureCode is HeadNodeBootstrapFailure with failureReason Cluster creation timed out..

Seeing Resource creation cancelled in CloudFormation stack

For more information, see failureCode is ResourceCreationFailure.

Seeing Failed to run cfn-init... or other errors in the AWS CloudFormation stack

Check the /var/log/cfn-init.log and /var/log/chef-client.log for additional failure details.

Seeing chef-client.log ends with INFO: Waiting for static fleet capacity provisioning

This is related to cluster creation timeout when waiting for static nodes to power up. For more information, see Seeing errors in compute node initializations.

Seeing Failed to run preinstall or postinstall in cfn-init.log

You have an OnNodeConfigured or OnNodeStart script in the cluster configuration HeadNode section. The script isn't working correctly. Check the /var/log/cfn-init.log file for custom script error details.

Seeing This AMI was created with xxx, but is trying to be used with xxx... in CloudFormation stack

For more information, see failureCode is AmiVersionMismatch.

Seeing This AMI was not baked by AWS ParallelCluster... in CloudFormation stack

For more information, see failureCode is InvalidAmi.

Seeing pcluster create-cluster command fails to run locally

Check the ~/.parallelcluster/pcluster-cli.log in your local file system for failure details.

Additional support

Follow the troubleshooting guidance in Troubleshooting cluster deployment issues.

Check to see if your scenario is covered in GitHub Known Issues at AWS ParallelCluster on GitHub.