Traditional Cluster Environment - High Performance Computing Lens

Traditional Cluster Environment

Many users begin their cloud journey with an environment that is similar to traditional HPC environments. The environment often involves a login node with a scheduler to launch jobs.

A common approach to traditional cluster provisioning is based on an AWS CloudFormation template for a compute cluster combined with customization for a user’s specific tasks. AWS ParallelCluster is an example of an end-to-end cluster provisioning capability based on AWS CloudFormation. Although the complex description of the architecture is hidden inside the template, typical configuration options allow the user to select the instance type, scheduler, or bootstrap actions, such as installing applications or synchronizing data. The template can be constructed and executed to provide an HPC environment with the “look and feel” of conventional HPC clusters, but with the added benefit of scalability. The login node maintains the scheduler, shared file system, and running environment. Meanwhile, an automatic scaling mechanism allows additional instances to spin up as jobs are submitted to a job queue. As instances become idle, they are automatically terminated.

A cluster can be deployed in a persistent configuration or treated as an ephemeral resource. Persistent clusters are deployed with a login instance and a compute fleet that can either be a fixed sized, or tied to an Auto Scaling group which increases and decreases the compute fleet depending on the number of submitted jobs. Persistent clusters always have some infrastructure running. Alternatively, clusters can be treated as ephemeral where each workload runs on its own cluster. Ephemeral clusters are enabled by automation. For example, a bash script is combined with the AWS CLI, or a Python script with the AWS SDK provides end-to-end case automation. For each case, resources are provisioned and launched, data is placed on the nodes, jobs are run across multiple nodes, and the case output is either retrieved automatically or sent to Amazon S3. Upon completion of the job, the infrastructure is terminated. These clusters treat infrastructure as code, optimize costs, and allow for complete version control of infrastructure changes.

Traditional cluster architectures can be used for loosely and tightly coupled workloads. For best performance, tightly coupled workloads must use a compute fleet in a clustered placement group with homogenous instance types.

Reference Architecture

Traditional cluster deployed with AWS ParallelCluster

Figure 1: Traditional cluster deployed with AWS ParallelCluster

Workflow steps:

  1. User initiates the creation of a cluster through the AWS ParallelCluster CLI and specification in the configuration file.

  2. AWS CloudFormation builds the cluster architecture as described in the cluster template file, where the user contributed a few custom settings (for example, by editing a configuration file or using a web interface).

  3. AWS CloudFormation deploys the infrastructure from EBS snapshots created with customized HPC software/applications that cluster instances can access through an NFS export.

  4. The user logs into the login instance and submits jobs to the scheduler (for example, SGE, Slurm).

  5. The login instance emits metrics to CloudWatch based on the job queue size.

  6. CloudWatch triggers Auto Scaling events to increase the number of compute instances if the job queue size exceeds a threshold.

  7. Scheduled jobs are processed by the compute fleet.

  8. [Optional] User initiates cluster deletion and termination of all resources.