Traditional Cluster Environment
Many users begin their cloud journey with an environment that is similar to traditional HPC environments. The environment often involves a login node with a scheduler to launch jobs.
A common approach to traditional cluster provisioning is based
on an AWS CloudFormation template for a compute cluster combined
with customization for a user’s specific tasks.
AWS ParallelCluster
A cluster can be deployed in a persistent configuration or treated as an ephemeral resource. Persistent clusters are deployed with a login instance and a compute fleet that can either be a fixed sized, or tied to an Auto Scaling group which increases and decreases the compute fleet depending on the number of submitted jobs. Persistent clusters always have some infrastructure running. Alternatively, clusters can be treated as ephemeral where each workload runs on its own cluster. Ephemeral clusters are enabled by automation. For example, a bash script is combined with the AWS CLI, or a Python script with the AWS SDK provides end-to-end case automation. For each case, resources are provisioned and launched, data is placed on the nodes, jobs are run across multiple nodes, and the case output is either retrieved automatically or sent to Amazon S3. Upon completion of the job, the infrastructure is terminated. These clusters treat infrastructure as code, optimize costs, and allow for complete version control of infrastructure changes.
Traditional cluster architectures can be used for loosely and tightly coupled workloads. For best performance, tightly coupled workloads must use a compute fleet in a clustered placement group with homogenous instance types.
Reference Architecture
Figure 1: Traditional cluster deployed with AWS ParallelCluster
Workflow steps:
-
User initiates the creation of a cluster through the AWS ParallelCluster CLI and specification in the configuration file.
-
AWS CloudFormation builds the cluster architecture as described in the cluster template file, where the user contributed a few custom settings (for example, by editing a configuration file or using a web interface).
-
AWS CloudFormation deploys the infrastructure from EBS snapshots created with customized HPC software/applications that cluster instances can access through an NFS export.
-
The user logs into the login instance and submits jobs to the scheduler (for example, SGE, Slurm).
-
The login instance emits metrics to CloudWatch based on the job queue size.
-
CloudWatch triggers Auto Scaling events to increase the number of compute instances if the job queue size exceeds a threshold.
-
Scheduled jobs are processed by the compute fleet.
-
[Optional] User initiates cluster deletion and termination of all resources.