查看上 AWS CloudFormation 发生的事件 CREATE_FAILED 使用 CLI 查看日志流使用 rollback-on-failure 重新创建失败的集群

排查集群部署问题

如果您的集群创建失败并回滚堆栈创建，则可以通过查看日志文件来诊断问题。失败消息可能类似于以下输出：


$ pcluster create-cluster --cluster-name mycluster --region eu-west-1 \
 --cluster-configuration cluster-config.yaml
{
  "cluster": {
    "clusterName": "mycluster",
    "cloudformationStackStatus": "CREATE_IN_PROGRESS",
    "cloudformationStackArn": "arn:aws:cloudformation:eu-west-1:xxx:stack/mycluster/1bf6e7c0-0f01-11ec-a3b9-024fcc6f3387",
    "region": "eu-west-1",
    "version": "3.13.2",
    "clusterStatus": "CREATE_IN_PROGRESS"
  }
}

$ pcluster describe-cluster --cluster-name mycluster --region eu-west-1
{
  "creationTime": "2021-09-06T11:03:47.696Z",
  ...
  "cloudFormationStackStatus": "ROLLBACK_IN_PROGRESS",
  "clusterName": "mycluster",
  "computeFleetStatus": "UNKNOWN",
  "cloudformationStackArn": "arn:aws:cloudformation:eu-west-1:xxx:stack/mycluster/1bf6e7c0-0f01-11ec-a3b9-024fcc6f3387",
  "lastUpdatedTime": "2021-09-06T11:03:47.696Z",
  "region": "eu-west-1",
  "clusterStatus": "CREATE_FAILED"
}

主题

查看上 AWS CloudFormation 发生的事件 CREATE_FAILED
使用 CLI 查看日志流
使用 rollback-on-failure 重新创建失败的集群

查看上 AWS CloudFormation 发生的事件 `CREATE_FAILED`

您可以使用控制台或 AWS ParallelCluster CLI 查看CREATE_FAILED错误 CloudFormation 事件，以帮助找到根本原因。

主题

在 CloudFormation 控制台中查看事件
使用 CLI 查看和筛选 CloudFormation 以下事件 CREATE_FAILED

在 CloudFormation 控制台中查看事件

要查看有关导致该"CREATE_FAILED"状态的原因的更多信息，您可以使用 CloudFormation 控制台。

从控制台查看 CloudFormation 错误消息。

登录 AWS Management Console 并导航到 https://console.aws.amazon.com/cloudformation。
选择名为 cluster_name 的堆栈。
选择事件选项卡。
通过按逻辑 ID 滚动浏览资源事件列表，查看创建失败的资源的状态。如果子任务创建失败，请向后移动，找到失败的资源事件。

例如，如果您看到以下状态消息，则必须使用不会超过当前 vCPU 限制或请求更多 vCPU 容量的实例类型。


2022-02-04 16:09:44 UTC-0800	HeadNode	CREATE_FAILED	You have requested more vCPU capacity than your current vCPU limit of 0 allows 
     for the instance bucket that the specified instance type belongs to. Please visit http://aws.amazon.com/contact-us/ec2-request to request an adjustment to this limit. 
     (Service: AmazonEC2; Status Code: 400; Error Code: VcpuLimitExceeded; Request ID: a9876543-b321-c765-d432-dcba98766789; Proxy: null).

使用 CLI 查看和筛选 CloudFormation 以下事件 `CREATE_FAILED`

要诊断集群创建问题，您可以通过筛选 CREATE_FAILED 状态来使用 pcluster get-cluster-stack-events 命令。有关更多信息，请参阅《AWS Command Line Interface 用户指南》中的筛选 AWS CLI 输出。


$ pcluster get-cluster-stack-events --cluster-name mycluster --region eu-west-1 \
    --query 'events[?resourceStatus==`CREATE_FAILED`]'
  [
    {
      "eventId": "3ccdedd0-0f03-11ec-8c06-02c352fe2ef9",
      "physicalResourceId": "arn:aws:cloudformation:eu-west-1:xxx:stack/mycluster/1bf6e7c0-0f02-11ec-a3b9-024fcc6f3387",
      "resourceStatus": "CREATE_FAILED",
      "resourceStatusReason": "The following resource(s) failed to create: [HeadNode]. ",
      "stackId": "arn:aws:cloudformation:eu-west-1:xxx:stack/mycluster/1bf6e7c0-0f02-11ec-a3b9-024fcc6f3387",
      "stackName": "mycluster",
      "logicalResourceId": "mycluster",
      "resourceType": "AWS::CloudFormation::Stack",
      "timestamp": "2021-09-06T11:11:51.780Z"
    },
    {
      "eventId": "HeadNode-CREATE_FAILED-2021-09-06T11:11:50.127Z",
      "physicalResourceId": "i-04e91cc1f4ea796fe",
      "resourceStatus": "CREATE_FAILED",
      "resourceStatusReason": "Received FAILURE signal with UniqueId i-04e91cc1f4ea796fe",
      "resourceProperties": "{\"LaunchTemplate\":{\"Version\":\"1\",\"LaunchTemplateId\":\"lt-057d2b1e687f05a62\"}}",
      "stackId": "arn:aws:cloudformation:eu-west-1:xxx:stack/mycluster/1bf6e7c0-0f02-11ec-a3b9-024fcc6f3387",
      "stackName": "mycluster",
      "logicalResourceId": "HeadNode",
      "resourceType": "AWS::EC2::Instance",
      "timestamp": "2021-09-06T11:11:50.127Z"
    }
  ]

在前面的示例中，失败发生在头节点设置中。

使用 CLI 查看日志流

要调试此类问题，您可以通过筛选 node-type 来使用 pcluster list-cluster-log-streams 列出头节点中的可用日志流，然后分析日志流内容。


$ pcluster list-cluster-log-streams --cluster-name mycluster --region eu-west-1 \
--filters 'Name=node-type,Values=HeadNode'
{
  "logStreams": [
    {
      "logStreamArn": "arn:aws:logs:eu-west-1:xxx:log-group:/aws/parallelcluster/mycluster-202109061103:log-stream:ip-10-0-0-13.i-04e91cc1f4ea796fe.cfn-init",
      "logStreamName": "ip-10-0-0-13.i-04e91cc1f4ea796fe.cfn-init",
      ...
    },
    {
      "logStreamArn": "arn:aws:logs:eu-west-1:xxx:log-group:/aws/parallelcluster/mycluster-202109061103:log-stream:ip-10-0-0-13.i-04e91cc1f4ea796fe.chef-client",
      "logStreamName": "ip-10-0-0-13.i-04e91cc1f4ea796fe.chef-client",
      ...
    },
    {
      "logStreamArn": "arn:aws:logs:eu-west-1:xxx:log-group:/aws/parallelcluster/mycluster-202109061103:log-stream:ip-10-0-0-13.i-04e91cc1f4ea796fe.cloud-init",
      "logStreamName": "ip-10-0-0-13.i-04e91cc1f4ea796fe.cloud-init",
      ...
    },
    ...
  ]
}

可用于查找初始化错误的两个主要日志流如下：

cfn-init 是 cfn-init 脚本的日志。首先检查此日志流。在此日志中，您可能会看到“Command chef failed”错误。查看此行前面的几行，了解与该错误消息相关的更多细节。有关更多信息，请参阅 cfn-init。
cloud-init 是 cloud-init 的日志。如果您在 cfn-init 中没有看到任何内容，请接下来尝试查看此日志。

您可以使用 pcluster get-cluster-log-events 来检索日志流的内容（请注意使用 --limit 5 选项来限制检索到的事件数量）：


$ pcluster get-cluster-log-events --cluster-name mycluster \
  --region eu-west-1 --log-stream-name ip-10-0-0-13.i-04e91cc1f4ea796fe.cfn-init \
  --limit 5
{
  "nextToken": "f/36370880979637159565202782352491087067973952362220945409/s",
  "prevToken": "b/36370880752972385367337528725601470541902663176996585497/s",
  "events": [
    {
      "message": "2021-09-06 11:11:39,049 [ERROR] Unhandled exception during build: Command runpostinstall failed",
      "timestamp": "2021-09-06T11:11:39.049Z"
    },
    {
      "message": "Traceback (most recent call last):\n  File \"/opt/aws/bin/cfn-init\", line 176, in <module>\n    worklog.build(metadata, configSets)\n  File \"/usr/lib/python3.7/site-packages/cfnbootstrap/construction.py\", line 135, in build\n    Contractor(metadata).build(configSets, self)\n  File \"/usr/lib/python3.7/site-packages/cfnbootstrap/construction.py\", line 561, in build\n    self.run_config(config, worklog)\n  File \"/usr/lib/python3.7/site-packages/cfnbootstrap/construction.py\", line 573, in run_config\n    CloudFormationCarpenter(config, self._auth_config).build(worklog)\n  File \"/usr/lib/python3.7/site-packages/cfnbootstrap/construction.py\", line 273, in build\n    self._config.commands)\n  File \"/usr/lib/python3.7/site-packages/cfnbootstrap/command_tool.py\", line 127, in apply\n    raise ToolError(u\"Command %s failed\" % name)",
      "timestamp": "2021-09-06T11:11:39.049Z"
    },
    {
      "message": "cfnbootstrap.construction_errors.ToolError: Command runpostinstall failed",
      "timestamp": "2021-09-06T11:11:39.049Z"
    },
    {
      "message": "2021-09-06 11:11:49,212 [DEBUG] CloudFormation client initialized with endpoint https://cloudformation.eu-west-1.amazonaws.com",
      "timestamp": "2021-09-06T11:11:49.212Z"
    },
    {
      "message": "2021-09-06 11:11:49,213 [DEBUG] Signaling resource HeadNode in stack mycluster with unique ID i-04e91cc1f4ea796fe and status FAILURE",
      "timestamp": "2021-09-06T11:11:49.213Z"
    }
  ]
}

在前面的示例中，失败是由 runpostinstall 失败导致的，因此它与 CustomActions 的 OnNodeConfigured 配置参数中使用的自定义引导脚本的内容紧密相关。

使用 `rollback-on-failure` 重新创建失败的集群

AWS ParallelCluster 在 CloudWatch 日志组中创建集群日志流。您可以在控制台的 “自定义 CloudWatch 控制面板” 或 “日志” 组中查看这些日志。有关更多信息，请参阅与 Amazon CloudWatch 日志集成和亚马逊 CloudWatch 控制面板：如果没有可用的日志流，则失败可能是由 CustomActions 自定义引导脚本或 AMI 相关问题导致的。要在这种情况下诊断创建问题，请使用 pcluster create-cluster（包含设置为 false 的 --rollback-on-failure 参数）重新创建集群。然后，使用 SSH 查看集群，如以下所示：


$ pcluster create-cluster --cluster-name mycluster --region eu-west-1 \
   --cluster-configuration cluster-config.yaml --rollback-on-failure false
 {
   "cluster": {
     "clusterName": "mycluster",
     "cloudformationStackStatus": "CREATE_IN_PROGRESS",
     "cloudformationStackArn": "arn:aws:cloudformation:eu-west-1:xxx:stack/mycluster/1bf6e7c0-0f01-11ec-a3b9-024fcc6f3387",
     "region": "eu-west-1",
     "version": "3.13.2",
     "clusterStatus": "CREATE_IN_PROGRESS"
   }
 } 
 $ pcluster ssh --cluster-name mycluster

登录到头节点后，您应该可以找到三个主要的日志文件，可以用它们来查找错误。

/var/log/cfn-init.log 是 cfn-init 脚本的日志。首先查看此日志。在此日志中，您可能会看到类似“Command chef failed”的错误。查看此行前面的几行，了解与该错误消息相关的更多细节。有关更多信息，请参阅 cfn-init。
/var/log/cloud-init.log 是 cloud-init 的日志。如果您在 cfn-init.log 中没有看到任何内容，请接下来尝试查看此日志。
/var/log/cloud-init-output.log 是 cloud-init 运行的命令的输出。这包括 cfn-init 的输出。在大多数情况下，排查此类问题无需查看此日志。

Javascript 在您的浏览器中被禁用或不可用。

要使用 Amazon Web Services 文档，必须启用 Javascript。请参阅浏览器的帮助页面以了解相关说明。

文档惯例

集群运行状况指标故障排除

排查使用 Terraform 部署集群的问题