Launch instances with Capacity Blocks (CB) - AWS ParallelCluster

Launch instances with Capacity Blocks (CB)

AWS ParallelCluster supports On-Demand Capacity Reservations (ODCR) and Capacity Blocks (CB) for Machine Learning. Unlike ODCR, CB can have a future start time and is time-bound. For more information about launching with ODCR, see Launch instances with On-Demand Capacity Reservations (ODCR).

Using CB with AWS ParallelCluster

To configure your new or existing clusters to use a CB, you first need to have a valid CB in your AWS account. You can use the AWS Management Console, AWS Command Line Interface, or SDK to find and purchase an available CB by following the official documentation. Once you have a valid CB, you can set the CB Amazon Resource Name (ARN) and related parameters in your AWS ParallelCluster configuration file. For more information, see Find and purchase Capacity Blocks (CB)

CB in the cluster configuration

To use a CB for a specific queue you must use the CapacityReservationId parameter. Configure it to an existing CB ID. You can obtain the CB ARN from the AWS Management Console, AWS CLI, or SDK that you used to create the CB.

You must set CapacityType = CAPACITY_BLOCK for the queue where you want to use the CB. Set it to the InstanceType of the compute resource (the same as the Amazon Elastic Compute Cloud instance type of the CB).

When you specify the CapacityReservationId at the compute resource level, the InstanceType is optional because it will be automatically retrieved from the reservation.

When you use CapacityType = CAPACITY_BLOCK, MaxCount must be equal to MinCount and greater than 0, because all the instances that are part of the CB reservation are managed as static nodes.

At the cluster creation time, the head node waits for all the static nodes to be ready before it signals the success of the cluster creation. However, when you use CapacityType = CAPACITY_BLOCK, the nodes that are part of the compute resources associated with won't be considered for this check. The cluster will be created even if all the configured are not active.

The following configuration file snippet shows the parameters required to enable in the AWS ParallelCluster configuration file.

SlurmQueues: - Name: string CapacityType: CAPACITY_BLOCK ComputeResources: - Name: string InstanceType: String (EC2 Instance type of the CB) MinCount: integer (<= total capacity of the CB) MaxCount: integer (equal to MinCount) CapacityReservationTarget: CapacityReservationId: String (CB id)

How AWS ParallelCluster uses Capacity Blocks (CB)

AWS ParallelCluster manages static nodes associated with in a peculiar way. AWS ParallelCluster creates a cluster even if the CB is not yet active, and instances are launched automatically once the CB is active.

The Slurm nodes that correspond to compute resources, are associated with , and are not yet active, are kept in maintenance until they reach the CB start time. Slurm nodes remain in a reservation/maintenance state and are associated with the slurm admin user. This means they can accept jobs, but the jobs remain pending until the reservation is removed.

AWS ParallelCluster automatically updates Slurm reservations and puts the related CB nodes in maintenance state (corresponding to the CB state). When the CB is active, the Slurm reservation is removed, and the nodes start and become available for pending jobs or for new job submissions.

When the CB end time is reached, the nodes will be moved back to a reservation/maintenance state. It's up to users to resubmit/requeue the jobs to a new queue/compute-resource when the CB is no longer active and the instances are terminated.