本文属于机器翻译版本。若本译文内容与英语原文存在差异,则一律以英文原文为准。
排查集群部署问题
如果您的集群创建失败并回滚堆栈创建,则可以查看日志文件以诊断问题。失败消息可能类似于以下输出:
$
pcluster create-cluster --cluster-name
mycluster
--regioneu-west-1
\ --cluster-configurationcluster-config.yaml
{ "cluster": { "clusterName": "mycluster", "cloudformationStackStatus": "CREATE_IN_PROGRESS", "cloudformationStackArn": "arn:aws:cloudformation:eu-west-1:xxx:stack/mycluster/1bf6e7c0-0f01-11ec-a3b9-024fcc6f3387", "region": "eu-west-1", "version": "3.5.1", "clusterStatus": "CREATE_IN_PROGRESS" } }
$
pcluster describe-cluster --cluster-name
mycluster
--regioneu-west-1
{ "creationTime": "2021-09-06T11:03:47.696Z", ... "cloudFormationStackStatus": "ROLLBACK_IN_PROGRESS", "clusterName": "mycluster", "computeFleetStatus": "UNKNOWN", "cloudformationStackArn": "arn:aws:cloudformation:eu-west-1:xxx:stack/mycluster/1bf6e7c0-0f01-11ec-a3b9-024fcc6f3387", "lastUpdatedTime": "2021-09-06T11:03:47.696Z", "region": "eu-west-1", "clusterStatus": "CREATE_FAILED" }
查看上的AWS CloudFormation活动CREATE_FAILED
您可以使用控制台或AWS ParallelCluster CLI 查看CREATE_FAILED
错误 CloudFormation 事件,以帮助找到根本原因。
在 CloudFormation 控制台中查看事件
要查看有关导致"CREATE_FAILED"
状态的原因的更多信息,可以使用 CloudFormation 控制台。
从控制台查看 CloudFormation 错误消息。
-
登录AWS Management Console并导航到 https://console.aws.amazon.com/cloudformation
。 -
选择名为
cluster_name
的堆栈。 -
选择 “事件” 选项卡。
-
通过按逻辑 ID 滚动浏览资源事件列表,检查创建失败的资源的状态。如果创建子任务失败,则向后查找失败的资源事件。
-
例如,如果您看到以下状态消息,则必须使用不会超过当前 vCPU 限制的实例类型或请求更多 vCPU 容量。
2022-02-04 16:09:44 UTC-0800 HeadNode CREATE_FAILED You have requested more vCPU capacity than your current vCPU limit of 0 allows for the instance bucket that the specified instance type belongs to. Please visit http://aws.amazon.com/contact-us/ec2-request to request an adjustment to this limit. (Service: AmazonEC2; Status Code: 400; Error Code: VcpuLimitExceeded; Request ID: a9876543-b321-c765-d432-dcba98766789; Proxy: null).
使用 CLI 查看和筛选上的 CloudFormation 事件CREATE_FAILED
要诊断集群创建问题,您可以通过筛选CREATE_FAILED
状态来使用pcluster get-cluster-stack-events命令。有关更多信息,请参阅AWS Command Line Interface用户指南中的筛选AWS CLI输出。
$
pcluster get-cluster-stack-events --cluster-name
mycluster
--regioneu-west-1
\ --query 'events[?resourceStatus==`CREATE_FAILED`]'[ { "eventId": "3ccdedd0-0f03-11ec-8c06-02c352fe2ef9", "physicalResourceId": "arn:aws:cloudformation:eu-west-1:xxx:stack/mycluster/1bf6e7c0-0f02-11ec-a3b9-024fcc6f3387", "resourceStatus": "CREATE_FAILED", "resourceStatusReason": "The following resource(s) failed to create: [HeadNode]. ", "stackId": "arn:aws:cloudformation:eu-west-1:xxx:stack/mycluster/1bf6e7c0-0f02-11ec-a3b9-024fcc6f3387", "stackName": "mycluster", "logicalResourceId": "mycluster", "resourceType": "AWS::CloudFormation::Stack", "timestamp": "2021-09-06T11:11:51.780Z" }, { "eventId": "HeadNode-CREATE_FAILED-2021-09-06T11:11:50.127Z", "physicalResourceId": "i-04e91cc1f4ea796fe", "resourceStatus": "CREATE_FAILED", "resourceStatusReason": "Received FAILURE signal with UniqueId i-04e91cc1f4ea796fe", "resourceProperties": "{\"LaunchTemplate\":{\"Version\":\"1\",\"LaunchTemplateId\":\"lt-057d2b1e687f05a62\"}}", "stackId": "arn:aws:cloudformation:eu-west-1:xxx:stack/mycluster/1bf6e7c0-0f02-11ec-a3b9-024fcc6f3387", "stackName": "mycluster", "logicalResourceId": "HeadNode", "resourceType": "AWS::EC2::Instance", "timestamp": "2021-09-06T11:11:50.127Z" } ]
在前面的示例中,故障出在头节点设置中。
使用 CLI 查看日志流
要调试此类问题,您可以pcluster list-cluster-log-streams通过筛选node-type
然后分析日志流内容来列出头节点中可用的日志流。
$
pcluster list-cluster-log-streams --cluster-name
mycluster
--regioneu-west-1
\ --filters 'Name=node-type,Values=HeadNode'{ "logStreams": [ { "logStreamArn": "arn:aws:logs:eu-west-1:xxx:log-group:/aws/parallelcluster/mycluster-202109061103:log-stream:ip-10-0-0-13.i-04e91cc1f4ea796fe.cfn-init", "logStreamName": "ip-10-0-0-13.i-04e91cc1f4ea796fe.cfn-init", ... }, { "logStreamArn": "arn:aws:logs:eu-west-1:xxx:log-group:/aws/parallelcluster/mycluster-202109061103:log-stream:ip-10-0-0-13.i-04e91cc1f4ea796fe.chef-client", "logStreamName": "ip-10-0-0-13.i-04e91cc1f4ea796fe.chef-client", ... }, { "logStreamArn": "arn:aws:logs:eu-west-1:xxx:log-group:/aws/parallelcluster/mycluster-202109061103:log-stream:ip-10-0-0-13.i-04e91cc1f4ea796fe.cloud-init", "logStreamName": "ip-10-0-0-13.i-04e91cc1f4ea796fe.cloud-init", ... }, ... ] }
可用于查找初始化错误的两个主要日志流如下:
您可以使用以下方法检索日志流的内容pcluster get-cluster-log-events(请注意限制检索到的事件数量的--limit 5
选项):
$
pcluster get-cluster-log-events --cluster-name
mycluster
\ --regioneu-west-1
--log-stream-nameip-10-0-0-13.i-04e91cc1f4ea796fe.cfn-init
\ --limit 5{ "nextToken": "f/36370880979637159565202782352491087067973952362220945409/s", "prevToken": "b/36370880752972385367337528725601470541902663176996585497/s", "events": [ { "message": "2021-09-06 11:11:39,049 [ERROR] Unhandled exception during build: Command runpostinstall failed", "timestamp": "2021-09-06T11:11:39.049Z" }, { "message": "Traceback (most recent call last):\n File \"/opt/aws/bin/cfn-init\", line 176, in <module>\n worklog.build(metadata, configSets)\n File \"/usr/lib/python3.7/site-packages/cfnbootstrap/construction.py\", line 135, in build\n Contractor(metadata).build(configSets, self)\n File \"/usr/lib/python3.7/site-packages/cfnbootstrap/construction.py\", line 561, in build\n self.run_config(config, worklog)\n File \"/usr/lib/python3.7/site-packages/cfnbootstrap/construction.py\", line 573, in run_config\n CloudFormationCarpenter(config, self._auth_config).build(worklog)\n File \"/usr/lib/python3.7/site-packages/cfnbootstrap/construction.py\", line 273, in build\n self._config.commands)\n File \"/usr/lib/python3.7/site-packages/cfnbootstrap/command_tool.py\", line 127, in apply\n raise ToolError(u\"Command %s failed\" % name)", "timestamp": "2021-09-06T11:11:39.049Z" }, { "message": "cfnbootstrap.construction_errors.ToolError: Command runpostinstall failed", "timestamp": "2021-09-06T11:11:39.049Z" }, { "message": "2021-09-06 11:11:49,212 [DEBUG] CloudFormation client initialized with endpoint https://cloudformation.eu-west-1.amazonaws.com", "timestamp": "2021-09-06T11:11:49.212Z" }, { "message": "2021-09-06 11:11:49,213 [DEBUG] Signaling resource HeadNode in stack mycluster with unique ID i-04e91cc1f4ea796fe and status FAILURE", "timestamp": "2021-09-06T11:11:49.213Z" } ] }
在前面的示例中,失败是由runpostinstall
失败引起的,因此它与的OnNodeConfigured
配置参数中使用的自定义引导脚本的内容严格相关CustomActions。
使用以下命令重新创建失败的集群rollback-on-failure
AWS ParallelCluster在 CloudWatch 日志组中创建集群日志流。您可以在 CloudWatch 控制台的自定义仪表板或日志组中查看这些日志。有关更多信息,请参阅 与 Amazon CloudWatch Logs 和 亚马逊 CloudWatch 控制面板。如果没有可用的日志流,则失败可能是由CustomActions自定义引导脚本或 AMI 相关问题引起的。要诊断这种情况下的创建问题,请使用再次创建集群pcluster create-cluster,包括将--rollback-on-failure
参数设置为false
。然后,使用 SSH 查看集群,如以下所示:
$
pcluster create-cluster --cluster-name
mycluster
--regioneu-west-1
\ --cluster-configurationcluster-config.yaml
--rollback-on-failure false{ "cluster": { "clusterName": "mycluster", "cloudformationStackStatus": "CREATE_IN_PROGRESS", "cloudformationStackArn": "arn:aws:cloudformation:eu-west-1:xxx:stack/mycluster/1bf6e7c0-0f01-11ec-a3b9-024fcc6f3387", "region": "eu-west-1", "version": "3.5.1", "clusterStatus": "CREATE_IN_PROGRESS" } }
$
pcluster ssh --cluster-name
mycluster
登录到头节点后,你应该找到三个可用于查找错误的主日志文件。
-
/var/log/cfn-init.log
是cfn-init
脚本的日志。首先检查这个日志。你可能会看到一个错误,比如Command chef failed
在这个日志中。查看此行前面的几行,了解与错误消息相关的更多细节。有关更多信息,请参阅 cfn-init。 -
/var/log/cloud-init.log
是云初始化的日志。如果您没有看到任何内容 cfn-init.log
,请尝试接下来查看此日志。 -
/var/log/cloud-init-output.log
是由 c loud-init 运行的命令的输出。这包括的输出 cfn-init
。在大多数情况下,您无需查看此日志即可对此类问题排查此类问题。