在多個佇列模式叢集中執行任務

本教學課程說明如何執行第一個「Hello World" 上的任務 AWS ParallelCluster 具有多個佇列模式。

使用 AWS ParallelCluster 命令列介面（CLI）或時API，您只需支付建立或更新 AWS ParallelCluster 映像和叢集時建立 AWS 的資源。如需詳細資訊，請參閱AWS 所使用的服務 AWS ParallelCluster。

AWS ParallelCluster 使用者介面建立在無伺服器架構上，您可以在 AWS 免費方案類別中使用它。如需詳細資訊，請參閱AWS ParallelCluster 使用者介面成本。

必要條件

AWS ParallelCluster 已安裝。
AWS CLI 已安裝並設定。
您有 Amazon EC2金鑰對。
您的IAM角色具有執行 pcluster 所需的許可CLI。

設定叢集

首先，執行下列命令來驗證 AWS ParallelCluster 是否正確安裝。


$ pcluster version

如需有關 pcluster version 的詳細資訊，請參閱 pcluster version。

此命令會傳回執行中的版本 AWS ParallelCluster。

接下來，執行 pcluster configure 以產生基本組態檔案。遵循此命令的所有提示。


$ pcluster configure --config multi-queue-mode.yaml

如需 pcluster configure 命令的詳細資訊，請參閱pcluster configure。

完成此步驟後，multi-queue-mode.yaml會出現名為的基本組態檔案。此檔案包含基本叢集組態。

在下一個步驟中，您可以修改新的組態檔案，並啟動具有多個佇列的叢集。

注意

本教學課程使用的某些執行個體不符合自由層級資格。

在此教學課程中，請修改您的組態檔案以符合下列組態。以紅色反白顯示的項目代表您的組態檔案值。保留您自己的值。


Region: region-id
Image:
 Os: alinux2
HeadNode:
 InstanceType: c5.xlarge
 Networking:
   SubnetId: subnet-abcdef01234567890
 Ssh:
   KeyName: yourkeypair
Scheduling:
 Scheduler: slurm
 SlurmQueues:
 - Name: spot
   ComputeResources:
   - Name: c5xlarge
     InstanceType: c5.xlarge
     MinCount: 1
     MaxCount: 10
   - Name: t2micro
     InstanceType: t2.micro
     MinCount: 1
     MaxCount: 10
   Networking:
     SubnetIds:
     - subnet-abcdef01234567890
 - Name: ondemand
   ComputeResources:
   - Name: c52xlarge
     InstanceType: c5.2xlarge
     MinCount: 0
     MaxCount: 10
   Networking:
     SubnetIds:
     - subnet-021345abcdef6789

建立叢集

multi-queue-cluster 根據您的組態檔案建立名為的叢集。


$ pcluster create-cluster --cluster-name multi-queue-cluster --cluster-configuration multi-queue-mode.yaml
{
 "cluster": {
   "clusterName": "multi-queue-cluster",
   "cloudformationStackStatus": "CREATE_IN_PROGRESS",
   "cloudformationStackArn": "arn:aws:cloudformation:eu-west-1:123456789012:stack/multi-queue-cluster/1234567-abcd-0123-def0-abcdef0123456",
   "region": "eu-west-1",
   "version": "3.7.0",
   "clusterStatus": "CREATE_IN_PROGRESS"
 }
}

如需 pcluster create-cluster 命令的詳細資訊，請參閱pcluster create-cluster。

若要檢查叢集的狀態，請執行下列命令。


$ pcluster list-clusters
{
 "cluster": {
   "clusterName": "multi-queue-cluster",
   "cloudformationStackStatus": "CREATE_IN_PROGRESS",
   "cloudformationStackArn": "arn:aws:cloudformation:eu-west-1:123456789012:stack/multi-queue-cluster/1234567-abcd-0123-def0-abcdef0123456",
   "region": "eu-west-1",
   "version": "3.7.0",
   "clusterStatus": "CREATE_IN_PROGRESS"
 }
}

建立叢集時， clusterStatus 欄位會顯示 CREATE_COMPLETE。

登入主機節點

使用您的私有SSH金鑰檔案登入主機節點。


$ pcluster ssh --cluster-name multi-queue-cluster -i ~/path/to/yourkeyfile.pem

如需有關 pcluster ssh 的詳細資訊，請參閱 pcluster ssh。

登入後，請執行 sinfo命令來驗證排程器佇列是否已設定。

如需的詳細資訊sinfo，請參閱中的 sinfo Slurm 文件 。


$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
spot*        up   infinite     18  idle~ spot-dy-c5xlarge-[1-9],spot-dy-t2micro-[1-9]
spot*        up   infinite      2  idle  spot-st-c5xlarge-1,spot-st-t2micro-1
ondemand     up   infinite     10  idle~ ondemand-dy-c52xlarge-[1-10]

輸出顯示叢集中有一個t2.micro和一個處於可用idle狀態的c5.xlarge運算節點。

其他節點都處於省電狀態，以節點狀態的字~尾表示，沒有 Amazon EC2執行個體支援它們。預設佇列會以佇列名稱後面的字尾表示。 * spot是您的預設任務佇列。

在多個佇列模式下執行任務

接下來，嘗試執行任務以睡眠一段時間。任務稍後會輸出自己的主機名稱。請確定目前使用者可執行此指令碼。


$ tee <<EOF hellojob.sh
#!/bin/bash
sleep 30
echo "Hello World from \$(hostname)"
EOF

$ chmod +x hellojob.sh
$ ls -l hellojob.sh
-rwxrwxr-x 1 ec2-user ec2-user 57 Sep 23 21:57 hellojob.sh

使用 sbatch命令提交任務。使用 -N 2選項為此任務請求兩個節點，並確認任務已成功提交。如需 sbatch 的詳細資訊，請參閱 sbatch 在 Slurm 文件 中。


$ sbatch -N 2 --wrap "srun hellojob.sh"
Submitted batch job 1

您可以使用 squeue命令檢視佇列並檢查任務的狀態。由於您未指定特定佇列，因此會使用預設佇列（spot）。如需 squeue 的詳細資訊，請參閱 squeue 中的 Slurm 文件 。


$ squeue
JOBID PARTITION     NAME     USER  ST      TIME  NODES NODELIST(REASON)
   1      spot     wrap ec2-user  R       0:10      2 spot-st-c5xlarge-1,spot-st-t2micro-1

輸出顯示任務目前處於執行中狀態。等待任務完成。這大約需要 30 秒。然後，squeue再次執行。


$ squeue
JOBID PARTITION     NAME     USER          ST       TIME  NODES NODELIST(REASON)

現在佇列中的任務都已完成，請尋找slurm-1.out目前目錄中名為的輸出檔案。


$ cat slurm-1.out
Hello World from spot-st-t2micro-1
Hello World from spot-st-c5xlarge-1

輸出顯示任務在 spot-st-t2micro-1和 spot-st-c5xlarge-1節點上已成功執行。

現在使用下列命令指定特定執行個體的限制來提交相同的任務。


$ sbatch -N 3 -p spot -C "[c5.xlarge*1&t2.micro*2]" --wrap "srun hellojob.sh"
Submitted batch job 2

您針對使用這些參數sbatch：

-N 3– 請求三個節點。
-p spot– 將任務提交至spot佇列。您也可以指定，將任務提交至ondemand佇列-p ondemand。
-C "[c5.xlarge*1&t2.micro*2]"– 指定此任務的特定節點限制。這會請求一個c5.xlarge節點和兩個t2.micro節點用於此任務。

執行 sinfo命令以檢視節點和佇列。中的佇列 AWS ParallelCluster 在中稱為分割區 Slurm.


$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
spot*        up   infinite      1  alloc# spot-dy-t2micro-1
spot*        up   infinite     17  idle~ spot-dy-c5xlarge-[2-10],spot-dy-t2micro-[2-9]
spot*        up   infinite      1  mix   spot-st-c5xlarge-1
spot*        up   infinite      1  alloc spot-st-t2micro-1
ondemand     up   infinite     10  idle~ ondemand-dy-c52xlarge-[1-10]

節點正在開機。這由節點狀態的字#尾表示。執行 squeue 命令以檢視叢集中任務的相關資訊。


$ squeue
JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
   2      spot     wrap ec2-user CF       0:04      3 spot-dy-c5xlarge-1,spot-dy-t2micro-1,spot-st-t2micro-1

您的任務位於 CF（CONFIGURING）狀態，等待執行個體擴展並加入叢集。

大約三分鐘後，節點可用，任務進入 R（RUNNING）狀態。


$ squeue
JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
   2      spot     wrap ec2-user  R       0:07      3 spot-dy-t2micro-1,spot-st-c5xlarge-1,spot-st-t2micro-1

任務完成，且所有三個節點都處於 idle 狀態。


$ squeue
JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
spot*        up   infinite     17  idle~ spot-dy-c5xlarge-[1-9],spot-dy-t2micro-[2-9]
spot*        up   infinite      3  idle  spot-dy-t2micro-1,spot-st-c5xlarge-1,spot-st-t2micro-1
ondemand     up   infinite     10  idle~ ondemand-dy-c52xlarge-[1-10]

然後，在佇列中沒有任何任務後，請檢查slurm-2.out本機目錄中的。


$ cat slurm-2.out 
Hello World from spot-st-t2micro-1
Hello World from spot-dy-t2micro-1
Hello World from spot-st-c5xlarge-1

這是叢集的最終狀態。


$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
spot*        up   infinite     17  idle~ spot-dy-c5xlarge-[1-9],spot-dy-t2micro-[2-9]
spot*        up   infinite      3  idle  spot-dy-t2micro-1,spot-st-c5xlarge-1,spot-st-t2micro-1
ondemand     up   infinite     10  idle~ ondemand-dy-c52xlarge-[1-10]

登出叢集後，您可以執行來清除 pcluster delete-cluster。如需詳細資訊，請參閱 pcluster list-clusters 和 pcluster delete-cluster。


$ pcluster list-clusters
{
 "clusters": [
   {
     "clusterName": "multi-queue-cluster",
     "cloudformationStackStatus": "CREATE_COMPLETE",
     "cloudformationStackArn": "arn:aws:cloudformation:eu-west-1:123456789012:stack/multi-queue-cluster/1234567-abcd-0123-def0-abcdef0123456",
     "region": "eu-west-1",
     "version": "3.1.4",
     "clusterStatus": "CREATE_COMPLETE"
   }
 ]
}
$ pcluster delete-cluster -n multi-queue-cluster
{
 "cluster": {
   "clusterName": "multi-queue-cluster",
   "cloudformationStackStatus": "DELETE_IN_PROGRESS",
   "cloudformationStackArn": "arn:aws:cloudformation:eu-west-1:123456789012:stack/multi-queue-cluster/1234567-abcd-0123-def0-abcdef0123456",
   "region": "eu-west-1",
   "version": "3.1.4",
   "clusterStatus": "DELETE_IN_PROGRESS"
 }
}

您的瀏覽器已停用或無法使用 Javascript。

您必須啟用 Javascript，才能使用 AWS 文件。請參閱您的瀏覽器說明頁以取得說明。

文件慣用形式

設定和建立叢集

使用 AWS ParallelCluster API