Running jobs in a multiple queue mode cluster
This tutorial covers how to run your first "Hello World" job on AWS ParallelCluster with multiple queue mode.
When using the AWS ParallelCluster command line interface (CLI) or API, you only pay for the AWS resources that are created when you create or update AWS ParallelCluster images and clusters. For more information, see AWS services used by AWS ParallelCluster.
The PCUI is built on a serverless architecture and you can use it within the AWS Free Tier category for most cases. For more information, see PCUI costs.
Prerequisites
-
AWS ParallelCluster is installed.
-
The AWS CLI is installed and configured.
-
You have an Amazon EC2 key pair.
-
You have an IAM role with the permissions that are required to run the pcluster CLI.
Configure your cluster
First, verify that AWS ParallelCluster is correctly installed by running the following command.
$
pcluster version
For more information about pcluster version
, see pcluster version.
This command returns the running version of AWS ParallelCluster.
Next, run pcluster configure
to generate a basic configuration file. Follow all the prompts that
follow this command.
$
pcluster configure --config multi-queue-mode.yaml
For more information about the pcluster configure
command, see pcluster configure.
After you complete this step, a basic configuration file named multi-queue-mode.yaml
appears. This file contains a basic cluster
configuration.
In the next step, you modify your new configuration file and launch a cluster with multiple queues.
Note
Some instances that this tutorial uses aren't free-tier eligible.
For this tutorial, modify your configuration file to match the following configuration. The items that are highlighted in red represent your configuration file values. Keep your own values.
Region:
region-id
Image: Os: alinux2 HeadNode: InstanceType: c5.xlarge Networking: SubnetId:subnet-abcdef01234567890
Ssh: KeyName:yourkeypair
Scheduling: Scheduler: slurm SlurmQueues: - Name: spot ComputeResources: - Name: c5xlarge InstanceType: c5.xlarge MinCount: 1 MaxCount: 10 - Name: t2micro InstanceType: t2.micro MinCount: 1 MaxCount: 10 Networking: SubnetIds: -subnet-abcdef01234567890
- Name: ondemand ComputeResources: - Name: c52xlarge InstanceType: c5.2xlarge MinCount: 0 MaxCount: 10 Networking: SubnetIds: -subnet-021345abcdef6789
Create your cluster
Create a cluster that's named multi-queue-cluster
based on your configuration file.
$
pcluster create-cluster --cluster-name multi-queue-cluster --cluster-configuration multi-queue-mode.yaml
{ "cluster": { "clusterName": "multi-queue-cluster", "cloudformationStackStatus": "CREATE_IN_PROGRESS", "cloudformationStackArn": "arn:aws:cloudformation:eu-west-1:123456789012:stack/multi-queue-cluster/1234567-abcd-0123-def0-abcdef0123456", "region": "eu-west-1", "version": "3.7.0", "clusterStatus": "CREATE_IN_PROGRESS" } }
For more information about the pcluster create-cluster
command, see pcluster create-cluster.
To check the status of the cluster, run the following command.
$
pcluster list-clusters
{ "cluster": { "clusterName": "multi-queue-cluster", "cloudformationStackStatus": "CREATE_IN_PROGRESS", "cloudformationStackArn": "arn:aws:cloudformation:eu-west-1:123456789012:stack/multi-queue-cluster/1234567-abcd-0123-def0-abcdef0123456", "region": "eu-west-1", "version": "3.7.0", "clusterStatus": "CREATE_IN_PROGRESS" } }
When the cluster is created, the clusterStatus
field shows CREATE_COMPLETE
.
Log in to the head node
Use your private SSH key file to log in to the head node.
$
pcluster ssh --cluster-name multi-queue-cluster -i
~/path/to/yourkeyfile.pem
For more information about pcluster ssh
, see pcluster ssh.
After logging in, run the sinfo
command to verify that your scheduler queues are set up and
configured.
For more information about sinfo
, see sinfo
$
sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST spot* up infinite 18 idle~ spot-dy-c5xlarge-[1-9],spot-dy-t2micro-[1-9] spot* up infinite 2 idle spot-st-c5xlarge-1,spot-st-t2micro-1 ondemand up infinite 10 idle~ ondemand-dy-c52xlarge-[1-10]
The output shows that you have one t2.micro
and one c5.xlarge
compute node in the idle
state that are
available in your cluster.
Other nodes are all in the power saving state, indicated by the ~
suffix in node state, with no Amazon EC2 instances backing them. The
default queue is indicated by a *
suffix after its queue name. spot
is your default job queue.
Run job in multiple queue mode
Next, try to run a job to sleep for a while. The job later outputs its own hostname. Make sure that this script can be run by the current user.
$
tee <<EOF hellojob.sh
#!/bin/bash sleep 30 echo "Hello World from \$(hostname)" EOF
$
chmod +x hellojob.sh
$
ls -l hellojob.sh
-rwxrwxr-x 1 ec2-user ec2-user 57 Sep 23 21:57 hellojob.sh
Submit the job using the sbatch
command. Request two nodes for this job with the -N 2
option, and verify that the job submits successfully. For more information about sbatch
, see sbatch
$
sbatch -N 2 --wrap "srun hellojob.sh"
Submitted batch job 1
You can view your queue and check the status of the job with the squeue
command. Because you didn't specify a specific queue, the
default queue (spot
) is used. For more information about squeue
, see squeue
$
squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 1 spot wrap ec2-user R 0:10 2 spot-st-c5xlarge-1,spot-st-t2micro-1
The output shows that the job is currently in a running state. Wait for the job to finish. This takes about 30 seconds. Then, run
squeue
again.
$
squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
Now that the jobs in the queue have all finished, look for the output file that's named slurm-1.out
in your current
directory.
$
cat slurm-1.out
Hello World from spot-st-t2micro-1 Hello World from spot-st-c5xlarge-1
The output shows that the job ran successfully on the spot-st-t2micro-1
and spot-st-c5xlarge-1
nodes.
Now submit the same job by specifying constraints for specific instances with the following commands.
$
sbatch -N 3 -p spot -C "[c5.xlarge*1&t2.micro*2]" --wrap "srun hellojob.sh"
Submitted batch job 2
You used these parameters for sbatch
:
-
-N 3
– requests three nodes. -
-p spot
– submits the job to thespot
queue. You can also submit a job to theondemand
queue by specifying-p ondemand
. -
-C "[c5.xlarge*1&t2.micro*2]"
– specifies the specific node constraints for this job. This requests onec5.xlarge
node and twot2.micro
nodes to be used for this job.
Run the sinfo
command to view the nodes and queues. Queues in AWS ParallelCluster are called
partitions in Slurm.
$
sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST spot* up infinite 1 alloc# spot-dy-t2micro-1 spot* up infinite 17 idle~ spot-dy-c5xlarge-[2-10],spot-dy-t2micro-[2-9] spot* up infinite 1 mix spot-st-c5xlarge-1 spot* up infinite 1 alloc spot-st-t2micro-1 ondemand up infinite 10 idle~ ondemand-dy-c52xlarge-[1-10]
The nodes are powering up. This is indicated by the #
suffix on the node state. Run the squeue command to view
information about the jobs in the cluster.
$
squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2 spot wrap ec2-user CF 0:04 3 spot-dy-c5xlarge-1,spot-dy-t2micro-1,spot-st-t2micro-1
Your job is in the CF
(CONFIGURING) state, waiting for instances to scale up and
join the cluster.
After about three minutes, the nodes are available and the job enters the R
(RUNNING) state.
$
squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2 spot wrap ec2-user R 0:07 3 spot-dy-t2micro-1,spot-st-c5xlarge-1,spot-st-t2micro-1
The job finishes, and all three nodes are in the idle
state.
$
squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
$
sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST spot* up infinite 17 idle~ spot-dy-c5xlarge-[1-9],spot-dy-t2micro-[2-9] spot* up infinite 3 idle spot-dy-t2micro-1,spot-st-c5xlarge-1,spot-st-t2micro-1 ondemand up infinite 10 idle~ ondemand-dy-c52xlarge-[1-10]
Then, after no jobs remain in the queue, check for slurm-2.out
in your local directory.
$
cat slurm-2.out
Hello World from spot-st-t2micro-1 Hello World from spot-dy-t2micro-1 Hello World from spot-st-c5xlarge-1
This is the final state of the cluster.
$
sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST spot* up infinite 17 idle~ spot-dy-c5xlarge-[1-9],spot-dy-t2micro-[2-9] spot* up infinite 3 idle spot-dy-t2micro-1,spot-st-c5xlarge-1,spot-st-t2micro-1 ondemand up infinite 10 idle~ ondemand-dy-c52xlarge-[1-10]
After logging off of the cluster, you can clean up by running pcluster delete-cluster
. For more information, see pcluster list-clusters and pcluster delete-cluster.
$
pcluster list-clusters
{ "clusters": [ { "clusterName": "multi-queue-cluster", "cloudformationStackStatus": "CREATE_COMPLETE", "cloudformationStackArn": "arn:aws:cloudformation:eu-west-1:123456789012:stack/multi-queue-cluster/1234567-abcd-0123-def0-abcdef0123456", "region": "eu-west-1", "version": "3.1.4", "clusterStatus": "CREATE_COMPLETE" } ] }
$
pcluster delete-cluster -n multi-queue-cluster
{ "cluster": { "clusterName": "multi-queue-cluster", "cloudformationStackStatus": "DELETE_IN_PROGRESS", "cloudformationStackArn": "arn:aws:cloudformation:eu-west-1:123456789012:stack/multi-queue-cluster/1234567-abcd-0123-def0-abcdef0123456", "region": "eu-west-1", "version": "3.1.4", "clusterStatus": "DELETE_IN_PROGRESS" } }