Creating a cluster in AWS Parallel Computing Service
This topic provides an overview of available options and describes what to consider when you create a cluster in AWS Parallel Computing Service (AWS PCS). If this is your first time creating an AWS PCS cluster, we recommend you follow Get started with AWS Parallel Computing Service. The tutorial can help you create a working HPC system without expanding into all the available options and system architectures that are possible.
Prerequisites
-
An existing VPC and subnet that meet AWS PCS Networking requirements. Before you deploy a cluster for production use, we recommend that you have a thorough understanding of the VPC and subnet requirements. To create a VPC and subnet, see Creating a VPC for your AWS PCS cluster.
-
An IAM principal with permissions to create and manage AWS PCS resources. For more information, see Identity and Access Management for AWS Parallel Computing Service.
Create an AWS PCS cluster
You can use the AWS Management Console or AWS CLI to create a cluster.
- AWS Management Console
-
To create a cluster
-
Open the AWS PCS console at https://console.aws.amazon.com/pcs/home#/clusters
and choose Create cluster. -
In the Cluster setup section, enter the following fields:
-
Cluster name – A name for your cluster. The name can contain only alphanumeric characters (case-sensitive) and hyphens. It must start with an alphabetic character and can't be longer than 40 characters. The name must be unique within the AWS Region and AWS account that you're creating the cluster in.
-
Scheduler – Choose a scheduler and version. For more information, see Slurm versions in AWS PCS.
-
Controller size – Choose a size for your controller. This determines how many concurrent jobs and compute nodes can be managed by the AWS PCS cluster. You can only set the controller size when the cluster is created. For more information on sizing, see Cluster size in AWS PCS.
-
-
In the Networking section, select values for the following fields:
-
VPC – Choose an existing VPC that meets AWS PCS requirements. For more information, see AWS PCS VPC and subnet requirements and considerations. After you create the cluster, you can't change its VPC. If no VPCs are listed, you must create one first.
-
Subnet – All available subnets in the selected VPC are listed. Choose a subnet that meets the AWS PCS subnet requirements. For more information, see AWS PCS VPC and subnet requirements and considerations. We recommend you select a private subnet to avoid exposing your scheduler endpoints to the public internet.
-
Security groups – Specify the security group(s) that you want AWS PCS to associate with the network interfaces it creates for your cluster. You must select at least one security group that allows communication between your cluster and its compute nodes. You can select Quick create a security group to have AWS PCS create one with the necessary configuration in your selected VPC, or select an existing security group. For more information, see Security group requirements and considerations.
-
-
(Optional) In the Slurm accounting configuration section, you can enable Slurm accounting and set accounting parameters. For more information, see Slurm accounting in AWS PCS.
-
(Optional) In the Slurm configuration section, you can specify Slurm configuration options that override defaults set by AWS PCS:
-
Scale down idle time – This controls how long dynamically-provisioned compute nodes stay active after jobs placed on them complete or terminate. Setting this to a longer value can make it more likely that a subsequent job can run on the node, but may lead to increased costs. A shorter value will decrease costs, but may increase the proportion of time your HPC system spends provisioning nodes as opposed to running jobs on them.
-
Prolog – This is a fully-qualified path to a prolog scripts directory on your compute node group instances. This corresponds to the Prolog setting
in Slurm. Note that this must be a directory, not a path to a specific executable. -
Epilog – This is a fully-qualified path to an epilog scripts directory on your compute node group instances. This corresponds to the Epilog setting
in Slurm. Note that this must be a directory, not a path to a specific executable. -
Select type parameters – This helps control the resource selection algorithm used by Slurm. Setting this value to
CR_CPU_Memory
will activate memory-aware scheduling, while setting it toCR_CPU
will activate CPU-only scheduling. This parameter corresponds to the SelectTypeParameterssetting in Slurm where SelectType
is set toselect/cons_tres
by AWS PCS.
-
-
(Optional) Under Tags, add any tags to your AWS PCS cluster.
-
Choose Create cluster. The Status field shows
Creating
while the AWS PCS creates the cluster. This process can take several minutes.
Important
There can only be 1 cluster in a
Creating
state per AWS Region per AWS account. AWS PCS returns an error if there is already a cluster in aCreating
state when you try to create a cluster. -
- AWS CLI
-
To create a cluster
-
Create your cluster with the command that follows. Before running the command, make the following replacements:
-
Replace
region
with the ID of the AWS Region that you want to create your cluster in, such asus-east-1
. -
Replace
my-cluster
with a name for your cluster. The name can contain only alphanumeric characters (case-sensitive) and hyphens. It must start with an alphabetic character and can't be longer than 40 characters. The name must be unique within the AWS Region and AWS account where you're creating the cluster. -
Replace
24.11
with any supported version of Slurm.Note
AWS PCS currently supports Slurm 24.11 and 24.05.
-
Replace
SMALL
with any supported cluster size. This determines how many concurrent jobs and compute nodes can be managed by the AWS PCS cluster. It can only be set when the cluster is created. For more information on sizing, see Cluster size in AWS PCS. -
Replace the value for
subnetIds
with your own. We recommend you select a private subnet to avoid exposing your scheduler endpoints to the public internet. -
Specify the
securityGroupIds
that you want AWS PCS to associate with the network interfaces it creates for your cluster. The security groups must be in the same VPC as the cluster. You must select at least one security group that allows communication between your cluster and its compute nodes. For more information, see Security group requirements and considerations. -
Optionally, you can provide a custom KMS key to encrypt your controller’s data using
--kms-key-id
. Replacekms-key
with an existing KMS ARN, key ID, or alias. Note that the account used to create the cluster must havekms-key
kms:Decrypt
privileges on the custom KMS key.
aws pcs create-cluster --region
region
\ --cluster-namemy-cluster
\ --scheduler type=SLURM,version=24.11
\ --sizeSMALL
\ --networking subnetIds=subnet-ExampleId1
,securityGroupIds=sg-ExampleId1
-
Optionally, you can add the
--slurm-configration
option to customize the Slurm behavior and specify Slurm configuration options. The following example sets the scale-down idle time to 60 minutes (3600 seconds), enables Slurm accounting, and specifiesslurm.conf
settings as the value forslurmCustomSettings
. For more information, see Slurm accounting in AWS PCS.Note
Accounting is supported for Slurm 24.11 or later.
aws pcs create-cluster --region
region
\ --cluster-namemy-cluster
\ --scheduler type=SLURM,version=24.11
\ --sizeSMALL
\ --networking subnetIds=subnet-ExampleId1
,securityGroupIds=sg-ExampleId1
--slurm-configuration scaleDownIdleTimeInSeconds=3600,accounting='{mode=STANDARD}',slurmCustomSettings='[{parameterName=SelectTypeParameters,parameterValue=CR_CPU_Memory}]'
-
-
It can take several minutes to provision the cluster. You can query the status of your cluster with the following command. Don’t proceed to creating queues or compute node groups until the cluster’s status field is
ACTIVE
.aws pcs get-cluster --region
region
--cluster-identifiermy-cluster
Important
There can only be 1 cluster in a
Creating
state per AWS Region per AWS account. AWS PCS returns an error if there is already a cluster in aCreating
state when you try to create a cluster. -
Recommended next steps for your cluster
-
Add compute node groups.
-
Add queues.
-
Enable logging.