疑難排解叢集部署問 - AWS ParallelCluster

本文為英文版的機器翻譯版本,如內容有任何歧義或不一致之處,概以英文版為準。

疑難排解叢集部署問

如果您的叢集無法建立並回復堆疊建立,您可以查看記錄檔以診斷問題。失敗訊息可能看起來像下列輸出:

$ pcluster create-cluster --cluster-name mycluster --region eu-west-1 \ --cluster-configuration cluster-config.yaml { "cluster": { "clusterName": "mycluster", "cloudformationStackStatus": "CREATE_IN_PROGRESS", "cloudformationStackArn": "arn:aws:cloudformation:eu-west-1:xxx:stack/mycluster/1bf6e7c0-0f01-11ec-a3b9-024fcc6f3387", "region": "eu-west-1", "version": "3.7.0", "clusterStatus": "CREATE_IN_PROGRESS" } } $ pcluster describe-cluster --cluster-name mycluster --region eu-west-1 { "creationTime": "2021-09-06T11:03:47.696Z", ... "cloudFormationStackStatus": "ROLLBACK_IN_PROGRESS", "clusterName": "mycluster", "computeFleetStatus": "UNKNOWN", "cloudformationStackArn": "arn:aws:cloudformation:eu-west-1:xxx:stack/mycluster/1bf6e7c0-0f01-11ec-a3b9-024fcc6f3387", "lastUpdatedTime": "2021-09-06T11:03:47.696Z", "region": "eu-west-1", "clusterStatus": "CREATE_FAILED" }

查看 AWS CloudFormation 事件 CREATE_FAILED

您可以使用主控台或 AWS ParallelCluster CLI 檢視CREATE_FAILED錯誤的 CloudFormation 事件,以協助找出根本原因。

在 CloudFormation 主控台中檢視事件

若要查看造成"CREATE_FAILED"狀態之原因的詳細資訊,您可以使用主 CloudFormation 控台。

從主控台檢視 CloudFormation 錯誤訊息。
  1. 登入 AWS Management Console 並瀏覽至 https://console.aws.amazon.com/cloudformation

  2. 選取名為叢集名稱的堆疊。

  3. 選擇「事件」標籤。

  4. 透過按邏輯 ID 捲動資源事件清單,檢查建立失敗之資源的「態」。如果子任務無法創建,請向後查找失敗的資源事件。

  5. 舉例來說,如果您看到下列狀態訊息,則必須使用不超過目前 vCPU 限制的執行個體類型,或要求更多 vCPU 容量。

    2022-02-04 16:09:44 UTC-0800 HeadNode CREATE_FAILED You have requested more vCPU capacity than your current vCPU limit of 0 allows for the instance bucket that the specified instance type belongs to. Please visit http://aws.amazon.com/contact-us/ec2-request to request an adjustment to this limit. (Service: AmazonEC2; Status Code: 400; Error Code: VcpuLimitExceeded; Request ID: a9876543-b321-c765-d432-dcba98766789; Proxy: null).

使用 CLI 來檢視和篩選 CloudFormation 事件 CREATE_FAILED

若要診斷叢集建立問題,您可以透過篩選CREATE_FAILED狀態來使用pcluster get-cluster-stack-events命令。如需詳細資訊,請參閱《使用指南》中的AWS Command Line Interface 〈篩選 AWS CLI 輸出

$ pcluster get-cluster-stack-events --cluster-name mycluster --region eu-west-1 \ --query 'events[?resourceStatus==`CREATE_FAILED`]' [ { "eventId": "3ccdedd0-0f03-11ec-8c06-02c352fe2ef9", "physicalResourceId": "arn:aws:cloudformation:eu-west-1:xxx:stack/mycluster/1bf6e7c0-0f02-11ec-a3b9-024fcc6f3387", "resourceStatus": "CREATE_FAILED", "resourceStatusReason": "The following resource(s) failed to create: [HeadNode]. ", "stackId": "arn:aws:cloudformation:eu-west-1:xxx:stack/mycluster/1bf6e7c0-0f02-11ec-a3b9-024fcc6f3387", "stackName": "mycluster", "logicalResourceId": "mycluster", "resourceType": "AWS::CloudFormation::Stack", "timestamp": "2021-09-06T11:11:51.780Z" }, { "eventId": "HeadNode-CREATE_FAILED-2021-09-06T11:11:50.127Z", "physicalResourceId": "i-04e91cc1f4ea796fe", "resourceStatus": "CREATE_FAILED", "resourceStatusReason": "Received FAILURE signal with UniqueId i-04e91cc1f4ea796fe", "resourceProperties": "{\"LaunchTemplate\":{\"Version\":\"1\",\"LaunchTemplateId\":\"lt-057d2b1e687f05a62\"}}", "stackId": "arn:aws:cloudformation:eu-west-1:xxx:stack/mycluster/1bf6e7c0-0f02-11ec-a3b9-024fcc6f3387", "stackName": "mycluster", "logicalResourceId": "HeadNode", "resourceType": "AWS::EC2::Instance", "timestamp": "2021-09-06T11:11:50.127Z" } ]

在前面的範例中,失敗是在頭節點設定中。

使用 CLI 檢視記錄串流

若要偵錯這類問題,您可以pcluster list-cluster-log-streams透過篩選,然後分析記錄資料流內容,來列出可從標頭節點取node-type得的記錄資料流。

$ pcluster list-cluster-log-streams --cluster-name mycluster --region eu-west-1 \ --filters 'Name=node-type,Values=HeadNode' { "logStreams": [ { "logStreamArn": "arn:aws:logs:eu-west-1:xxx:log-group:/aws/parallelcluster/mycluster-202109061103:log-stream:ip-10-0-0-13.i-04e91cc1f4ea796fe.cfn-init", "logStreamName": "ip-10-0-0-13.i-04e91cc1f4ea796fe.cfn-init", ... }, { "logStreamArn": "arn:aws:logs:eu-west-1:xxx:log-group:/aws/parallelcluster/mycluster-202109061103:log-stream:ip-10-0-0-13.i-04e91cc1f4ea796fe.chef-client", "logStreamName": "ip-10-0-0-13.i-04e91cc1f4ea796fe.chef-client", ... }, { "logStreamArn": "arn:aws:logs:eu-west-1:xxx:log-group:/aws/parallelcluster/mycluster-202109061103:log-stream:ip-10-0-0-13.i-04e91cc1f4ea796fe.cloud-init", "logStreamName": "ip-10-0-0-13.i-04e91cc1f4ea796fe.cloud-init", ... }, ... ] }

您可以用來尋找初始化錯誤的兩個主要記錄資料流如下:

  • cfn-init是指cfn-init令碼的記錄檔。首先檢查此日誌流。您可能會在此日誌中看到Command chef failed錯誤。查看此行之前的行,以獲取與錯誤消息相關的更多細節。如需詳細資訊,請參閱 CFN-初始化

  • cloud-init雲端初始化的記錄檔。如果您在中沒有看到任何內容cfn-init,請嘗試接下來檢查此日誌。

您可以使用 pcluster get-cluster-log-events (請注意限制擷取事件數目的--limit 5選項) 來擷取記錄資料流的內容:

$ pcluster get-cluster-log-events --cluster-name mycluster \ --region eu-west-1 --log-stream-name ip-10-0-0-13.i-04e91cc1f4ea796fe.cfn-init \ --limit 5 { "nextToken": "f/36370880979637159565202782352491087067973952362220945409/s", "prevToken": "b/36370880752972385367337528725601470541902663176996585497/s", "events": [ { "message": "2021-09-06 11:11:39,049 [ERROR] Unhandled exception during build: Command runpostinstall failed", "timestamp": "2021-09-06T11:11:39.049Z" }, { "message": "Traceback (most recent call last):\n File \"/opt/aws/bin/cfn-init\", line 176, in <module>\n worklog.build(metadata, configSets)\n File \"/usr/lib/python3.7/site-packages/cfnbootstrap/construction.py\", line 135, in build\n Contractor(metadata).build(configSets, self)\n File \"/usr/lib/python3.7/site-packages/cfnbootstrap/construction.py\", line 561, in build\n self.run_config(config, worklog)\n File \"/usr/lib/python3.7/site-packages/cfnbootstrap/construction.py\", line 573, in run_config\n CloudFormationCarpenter(config, self._auth_config).build(worklog)\n File \"/usr/lib/python3.7/site-packages/cfnbootstrap/construction.py\", line 273, in build\n self._config.commands)\n File \"/usr/lib/python3.7/site-packages/cfnbootstrap/command_tool.py\", line 127, in apply\n raise ToolError(u\"Command %s failed\" % name)", "timestamp": "2021-09-06T11:11:39.049Z" }, { "message": "cfnbootstrap.construction_errors.ToolError: Command runpostinstall failed", "timestamp": "2021-09-06T11:11:39.049Z" }, { "message": "2021-09-06 11:11:49,212 [DEBUG] CloudFormation client initialized with endpoint https://cloudformation.eu-west-1.amazonaws.com", "timestamp": "2021-09-06T11:11:49.212Z" }, { "message": "2021-09-06 11:11:49,213 [DEBUG] Signaling resource HeadNode in stack mycluster with unique ID i-04e91cc1f4ea796fe and status FAILURE", "timestamp": "2021-09-06T11:11:49.213Z" } ] }

在上一個範例中,失敗是由失runpostinstall敗造成的,因此與的OnNodeConfigured組態參數中使用的自訂啟動程序指令碼的內容嚴格相關CustomActions

使用重新建立失敗的叢集 rollback-on-failure

AWS ParallelCluster 會在 CloudWatch 記錄群組中建立叢集記錄資料流。您可以在 CloudWatch 主控台「自訂儀表板」或「記錄」群組中檢視這些記錄。如需詳細資訊,請參閱 與 Amazon Amazon Amazon CloudWatch 的整合Amazon CloudWatch 儀表。如果沒有可用的記錄串流,則失敗可能是由CustomActions自訂啟動程序指令碼或 AMI 相關問題所造成。若要診斷此情況下的建立問題,請使用 pcluster create-cluster (包括設定為的--rollback-on-failure參數) 再次建立叢集false。然後,使用 SSH 檢視叢集,如下所示:

$ pcluster create-cluster --cluster-name mycluster --region eu-west-1 \ --cluster-configuration cluster-config.yaml --rollback-on-failure false { "cluster": { "clusterName": "mycluster", "cloudformationStackStatus": "CREATE_IN_PROGRESS", "cloudformationStackArn": "arn:aws:cloudformation:eu-west-1:xxx:stack/mycluster/1bf6e7c0-0f01-11ec-a3b9-024fcc6f3387", "region": "eu-west-1", "version": "3.7.0", "clusterStatus": "CREATE_IN_PROGRESS" } } $ pcluster ssh --cluster-name mycluster

登入 head 節點之後,您應該會找到三個可用來尋找錯誤的主要記錄檔。

  • /var/log/cfn-init.log是指cfn-init令碼的記錄檔。首先檢查此日誌。您可能會Command chef failed在此記錄中看到錯誤訊息。查看此行之前的行,以獲取與錯誤消息相關的更多細節。如需詳細資訊,請參閱 CFN-初始化

  • /var/log/cloud-init.log雲端初始化的記錄檔。如果您在中沒有看到任何內容cfn-init.log,請嘗試接下來檢查此日誌。

  • /var/log/cloud-init-output.log是由雲初始化運行的命令的輸出。這包括來自的輸出cfn-init。在大多數情況下,您不需要查看此記錄檔即可疑難排解此類問題。