在多队列模式集群中运行作业

本教程介绍如何在多队列模式下运行第一个 Hello World “” 作业。 AWS ParallelCluster

使用 AWS ParallelCluster 命令行界面 (CLI) 或 API 时，您只需为创建或更新 AWS ParallelCluster 映像和集群时创建的 AWS 资源付费。有关更多信息，请参阅 AWS 使用的服务 AWS ParallelCluster。

先决条件

AWS ParallelCluster 已安装。
AWS CLI 已安装并配置。
你有一个 A mazon EC2 密钥对。
您拥有具有运行 pcluster CLI 所需的权限的 IAM 角色。

配置集群

首先，通过运行以下命令来验证安装 AWS ParallelCluster 是否正确。


$ pcluster version

有关 pcluster version的更多信息，请参阅pcluster version。

此命令返回的运行版本 AWS ParallelCluster。

接下来，运行 pcluster configure 以生成基本配置文件。按照运行此命令后的所有提示进行操作。


$ pcluster configure --config multi-queue-mode.yaml

有关 pcluster configure 命令的更多信息，请参阅pcluster configure。

完成此步骤后，将出现一个名为 multi-queue-mode.yaml 的基本配置文件。此文件包含基本集群配置。

在下一步中，您将修改新配置文件并启动包含多个队列的集群。

注意

本教程中使用的某些实例不符合免费套餐资格。

在本教程中，请修改您的配置文件以匹配以下配置。以红色突出显示的项目代表您的配置文件值。请使用您自己的值。


Region: region-id
Image:
 Os: alinux2
HeadNode:
 InstanceType: c5.xlarge
 Networking:
   SubnetId: subnet-abcdef01234567890
 Ssh:
   KeyName: yourkeypair
Scheduling:
 Scheduler: slurm
 SlurmQueues:
 - Name: spot
   ComputeResources:
   - Name: c5xlarge
     InstanceType: c5.xlarge
     MinCount: 1
     MaxCount: 10
   - Name: t2micro
     InstanceType: t2.micro
     MinCount: 1
     MaxCount: 10
   Networking:
     SubnetIds:
     - subnet-abcdef01234567890
 - Name: ondemand
   ComputeResources:
   - Name: c52xlarge
     InstanceType: c5.2xlarge
     MinCount: 0
     MaxCount: 10
   Networking:
     SubnetIds:
     - subnet-021345abcdef6789

创建集群

根据您的配置文件，创建一个名为 multi-queue-cluster 的集群。


$ pcluster create-cluster --cluster-name multi-queue-cluster --cluster-configuration multi-queue-mode.yaml
{
 "cluster": {
   "clusterName": "multi-queue-cluster",
   "cloudformationStackStatus": "CREATE_IN_PROGRESS",
   "cloudformationStackArn": "arn:aws:cloudformation:eu-west-1:123456789012:stack/multi-queue-cluster/1234567-abcd-0123-def0-abcdef0123456",
   "region": "eu-west-1",
   "version": "3.14.0",
   "clusterStatus": "CREATE_IN_PROGRESS"
 }
}

有关 pcluster create-cluster 命令的更多信息，请参阅pcluster create-cluster。

要检查集群的状态，请运行以下命令。


$ pcluster list-clusters
{
 "cluster": {
   "clusterName": "multi-queue-cluster",
   "cloudformationStackStatus": "CREATE_IN_PROGRESS",
   "cloudformationStackArn": "arn:aws:cloudformation:eu-west-1:123456789012:stack/multi-queue-cluster/1234567-abcd-0123-def0-abcdef0123456",
   "region": "eu-west-1",
   "version": "3.14.0",
   "clusterStatus": "CREATE_IN_PROGRESS"
 }
}

创建集群后，clusterStatus 字段将显示 CREATE_COMPLETE。

登录到头节点

使用您的私有 SSH 密钥文件登录到头节点。


$ pcluster ssh --cluster-name multi-queue-cluster -i ~/path/to/yourkeyfile.pem

有关 pcluster ssh的更多信息，请参阅pcluster ssh。

登录后，运行命令 sinfo 以验证是否已设置和配置调度器队列。

有关 sinfo 的更多信息，请参阅 Slurm 文档 中的 sinfo。


$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
spot*        up   infinite     18  idle~ spot-dy-c5xlarge-[1-9],spot-dy-t2micro-[1-9]
spot*        up   infinite      2  idle  spot-st-c5xlarge-1,spot-st-t2micro-1
ondemand     up   infinite     10  idle~ ondemand-dy-c52xlarge-[1-10]

输出显示您的集群中有一个 t2.micro 和一个 c5.xlarge 计算节点处于 idle 状态。

其他节点都处于节能状态，由节点状态的~后缀表示，没有 Amazon EC2 实例支持它们。默认队列由队列名称后面的 * 后缀指示。spot 是默认作业队列。

在多队列模式下运行作业

接下来，尝试将作业运行到睡眠模式一段时间。该作业稍后将输出自己的主机名。确保当前用户可以运行此脚本。


$ tee <<EOF hellojob.sh
#!/bin/bash
sleep 30
echo "Hello World from \$(hostname)"
EOF

$ chmod +x hellojob.sh
$ ls -l hellojob.sh
-rwxrwxr-x 1 ec2-user ec2-user 57 Sep 23 21:57 hellojob.sh

使用 sbatch 命令提交作业。使用 -N 2 选项为该作业请求两个节点，然后验证作业是否成功提交。有关 sbatch 的更多信息，请参阅 Slurm 文档 中的 sbatch。


$ sbatch -N 2 --wrap "srun hellojob.sh"
Submitted batch job 1

您可以使用 squeue 命令查看您的队列并检查该作业的状态。由于您未指定特定队列，因此使用默认队列 (spot)。有关 squeue 的更多信息，请参阅 Slurm 文档 中的 squeue。


$ squeue
JOBID PARTITION     NAME     USER  ST      TIME  NODES NODELIST(REASON)
   1      spot     wrap ec2-user  R       0:10      2 spot-st-c5xlarge-1,spot-st-t2micro-1

输出显示此作业目前处于运行状态。等待作业完成。这大约需要 30 秒。然后，再次运行 squeue。


$ squeue
JOBID PARTITION     NAME     USER          ST       TIME  NODES NODELIST(REASON)

现在，队列中的作业已全部完成，请在当前目录中查找名为 slurm-1.out 的输出文件。


$ cat slurm-1.out
Hello World from spot-st-t2micro-1
Hello World from spot-st-c5xlarge-1

输出显示该作业已在 spot-st-t2micro-1 和 spot-st-c5xlarge-1 节点上成功运行。

现在，通过使用以下命令为特定实例指定约束条件来提交相同的作业。


$ sbatch -N 3 -p spot -C "[c5.xlarge*1&t2.micro*2]" --wrap "srun hellojob.sh"
Submitted batch job 2

您对 sbatch 使用了以下参数：

-N 3：请求三个节点。
-p spot：将作业提交到 spot 队列。您也可以通过指定 -p ondemand，将作业提交到 ondemand 队列。
-C "[c5.xlarge*1&t2.micro*2]"：指定该作业的特定节点约束条件。这将请求对该作业使用一个 c5.xlarge 节点和两个 t2.micro 节点。

运行 sinfo 命令查看节点和队列。中的队 AWS ParallelCluster 列称为中的分区Slurm。


$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
spot*        up   infinite      1  alloc# spot-dy-t2micro-1
spot*        up   infinite     17  idle~ spot-dy-c5xlarge-[2-10],spot-dy-t2micro-[2-9]
spot*        up   infinite      1  mix   spot-st-c5xlarge-1
spot*        up   infinite      1  alloc spot-st-t2micro-1
ondemand     up   infinite     10  idle~ ondemand-dy-c52xlarge-[1-10]

节点正在启动。这由节点状态上的 # 后缀指示。运行 squeue 命令查看集群中作业的信息。


$ squeue
JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
   2      spot     wrap ec2-user CF       0:04      3 spot-dy-c5xlarge-1,spot-dy-t2micro-1,spot-st-t2micro-1

您的作业处于 CF (CONFIGURING) 状态，正在等待实例纵向扩展并加入集群。

大约三分钟后，节点可用，并且作业进入 R (RUNNING) 状态。


$ squeue
JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
   2      spot     wrap ec2-user  R       0:07      3 spot-dy-t2micro-1,spot-st-c5xlarge-1,spot-st-t2micro-1

作业完成，所有三个节点都处于 idle 状态。


$ squeue
JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
spot*        up   infinite     17  idle~ spot-dy-c5xlarge-[1-9],spot-dy-t2micro-[2-9]
spot*        up   infinite      3  idle  spot-dy-t2micro-1,spot-st-c5xlarge-1,spot-st-t2micro-1
ondemand     up   infinite     10  idle~ ondemand-dy-c52xlarge-[1-10]

然后，当队列中没有剩余作业后，在本地目录中查看 slurm-2.out。


$ cat slurm-2.out 
Hello World from spot-st-t2micro-1
Hello World from spot-dy-t2micro-1
Hello World from spot-st-c5xlarge-1

以下是集群的最终状态。


$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
spot*        up   infinite     17  idle~ spot-dy-c5xlarge-[1-9],spot-dy-t2micro-[2-9]
spot*        up   infinite      3  idle  spot-dy-t2micro-1,spot-st-c5xlarge-1,spot-st-t2micro-1
ondemand     up   infinite     10  idle~ ondemand-dy-c52xlarge-[1-10]

注销集群后，您可以通过运行 pcluster delete-cluster 来进行清理。有关更多信息，请参阅pcluster list-clusters和pcluster delete-cluster。


$ pcluster list-clusters
{
 "clusters": [
   {
     "clusterName": "multi-queue-cluster",
     "cloudformationStackStatus": "CREATE_COMPLETE",
     "cloudformationStackArn": "arn:aws:cloudformation:eu-west-1:123456789012:stack/multi-queue-cluster/1234567-abcd-0123-def0-abcdef0123456",
     "region": "eu-west-1",
     "version": "3.1.4",
     "clusterStatus": "CREATE_COMPLETE"
   }
 ]
}
$ pcluster delete-cluster -n multi-queue-cluster
{
 "cluster": {
   "clusterName": "multi-queue-cluster",
   "cloudformationStackStatus": "DELETE_IN_PROGRESS",
   "cloudformationStackArn": "arn:aws:cloudformation:eu-west-1:123456789012:stack/multi-queue-cluster/1234567-abcd-0123-def0-abcdef0123456",
   "region": "eu-west-1",
   "version": "3.1.4",
   "clusterStatus": "DELETE_IN_PROGRESS"
 }
}

Javascript 在您的浏览器中被禁用或不可用。

要使用 Amazon Web Services 文档，必须启用 Javascript。请参阅浏览器的帮助页面以了解相关说明。

文档惯例

配置和创建集群

使用 AWS ParallelCluster API