AWS ParallelCluster Auto Scaling - AWS ParallelCluster

AWS ParallelCluster Auto Scaling

The auto scaling strategy described here applies to HPC clusters that are deployed with one of the supported traditional job schedulers, either Son of Grid Engine ( SGE ), Slurm Workload Manager ( Slurm ), or Torque Resource Manager ( Torque ). When deployed with one of these schedulers, AWS ParallelCluster implements the scaling capabilities by managing the Auto Scaling Group (ASG) of the compute nodes, and then changing the scheduler configuration as needed. For HPC clusters that are based on AWS Batch, AWS ParallelCluster relies on the elastic scaling capabilities provided by the AWS-managed job scheduler. For more information, see What Is Amazon EC2 Auto Scaling in the Amazon EC2 Auto Scaling User Guide.

Clusters deployed with AWS ParallelCluster are elastic in several ways. Setting the initial_queue_size specifies the minimum size value of the ComputeFleet ASG, and also the desired capacity value. Setting the max_queue_size specifies the maximum size value of the ComputeFleet ASG.

   A basic Auto Scaling group.

Scaling Up

Every minute, a process called jobwatcher runs on the master instance. It evaluates the current number of instances required by the pending jobs in the queue. If the total number of busy nodes and requested nodes is greater than the current desired value in the ASG, it adds more instances. If you submit more jobs, the queue is re-evaluated and the ASG is updated, up to the specified max_queue_size.

With an SGE scheduler, each job requires a number of slots to run (one slot corresponds to one processing unit, for example, a vCPU). To evaluate the number of instances that are required to serve the currently pending jobs, the jobwatcher divides the total number of requested slots by the capacity of a single compute node. The capacity of a compute node that corresponds to the number of available vCPUs depends on the Amazon EC2 instance type that is specified in the cluster configuration.

With Slurm and Torque schedulers, each job might require both a number of nodes and a number of slots per node, depending on circumstance. For each request, the jobwatcher determines the number of compute nodes that are needed to fulfill the new computational requirements. For example, let's assume a cluster with c5.2xlarge (8 vCPU) as the compute instance type, and three queued pending jobs with the following requirements:

  • job1: 2 nodes / 4 slots each

  • job2: 3 nodes / 2 slots each

  • job3: 1 node / 4 slots each

In this example, the jobwatcher requires three new compute instances in the ASG to serve the three jobs.

Current limitation: auto scale up logic does not consider partially loaded busy nodes. i.e. A node that is running a job is considered busy even if there are empty slots.

Scaling Down

On each compute node, a process called nodewatcher runs and evaluates the idle time of the node. An instance is terminated when both of the following conditions are met:

  • An instance has no jobs for a period of time longer than the scaledown_idletime (the default setting is 10 minutes)

  • There are no pending jobs in the cluster

To terminate an instance, nodewatcher calls the TerminateInstanceInAutoScalingGroup API call, which removes an instance if the size of the ASG is at least the minimum ASG size. This process scales down a cluster without affecting running jobs. It also enables an elastic cluster, with a fixed base number of instances.

Static Cluster

The value of auto scaling is the same for HPC as with any other workloads. The only difference is that AWS ParallelCluster has code that makes it interact more intelligently. For example, if a static cluster is required, you set the initial_queue_size and max_queue_size parameters to the exact size of cluster that is required,. and then you set the maintain_initial_size parameter to true. This causes the ComputeFleet ASG to have the same value for minimum, maximum, and desired capacity.