CFD approaches on AWS - Computational Fluid Dynamics on AWS

This whitepaper is for historical reference only. Some content might be outdated and some links might not be available.

CFD approaches on AWS

Most CFD solvers have locality of data and use sparse matrix solvers. Once properly organized (application dependent), a well-configured job exhibits good strong and weak scaling on simple AWS cluster architectures. “Structured” and “Unstructured” codes are commonly run on AWS. Spectral and pseudo-spectral methods involve fast Fourier transforms (FFTs), and while less common than traditional CFD algorithms, they also scale well on AWS. Your architectural decisions have tradeoffs, and AWS makes it quick and easy to try different architectures to optimize for cost and performance.

Architectures

There are two primary design patterns to consider when choosing an AWS architecture for CFD applications: the traditional cluster and the cloud native cluster. Customers choose their preferred architecture based on the use case and the CFD users’ needs. For example, use cases that require frequent involvement and monitoring, such as when you need to start and stop the CFD case several times on the way to convergence, often prefer a traditional style cluster.

Conversely, cases that are easily automated often prefer a cloud native setup, which enables you to easily submit large numbers of cases simultaneously or automate your run for a complete end-to-end solution. Cloud native is useful when cases require special pre- or post-processing steps which benefit from automation. Whether choosing traditional or cloud native architectures, the cloud offers the advantage of elastic scalability — enabling you to only consume and pay for resources when you need them.

Traditional cluster environments

In the cloud, a traditional cluster is also referred to as a persistent cluster due to the persistence of minimal AWS infrastructure required for preserving the cluster environment. Examples of persistent infrastructure include a node running a scheduler and hosting data even after a completed run campaign. The persistent cluster mimics a traditional on-premises cluster or supercomputer experience. Clusters include a login instance with a scheduler that allows multiple users to submit jobs. The compute node fleet can be a fixed size or a dynamic group to increase and decrease the number of compute instances depending on the jobs submitted.

AWS ParallelCluster is an example of a persistent cluster that simplifies the deployment and management of HPC clusters in the AWS Cloud. It enables you to quickly launch and terminate an HPC compute environment in AWS as needed. AWS ParallelCluster orchestrates the creation of the required resources (for example, compute nodes and shared filesystems) and provides an automatic scaling mechanism to adjust the size of the cluster to match the submitted workload. You can use AWS ParallelCluster with a variety of batch schedulers, including Slurm and AWS Batch.

An example AWS ParallelCluster architecture.

Example AWS ParallelCluster architecture

An AWS ParallelCluster architecture enables the following workflow:

  • Creating a desired configuration through a text file

  • Launching a cluster through the AWS ParallelCluster Command Line Utility (CLI)

  • Orchestrating AWS services automatically through AWS CloudFormation

  • Accessing the cluster through the command line with Secure Shell Protocol (SSH) or graphically with NICE DCV

Cloud native environments

A cloud native cluster is also called an ephemeral cluster due to its relatively short lifetime. A cloud native approach to tightly coupled HPC ties each run, or sweep of runs, to its own cluster. For each case, resources are provisioned and launched, data is placed on the instances, jobs run across multiple instances, and case output is retrieved automatically or sent to Amazon S3. Upon job completion, the infrastructure is ended. Clusters designed this way are ephemeral, treat infrastructure as code, and allow for complete version control of infrastructure changes. Login nodes and job schedulers are less critical and often not used at all with an ephemeral cluster. The following are a few frequently used methods to implement such a design:

  • Scripted approach — A common quick-start approach for CFD users getting started with AWS is to combine a custom Amazon Machine Image (AMI) with the AWS CLI and a bash script. After launching an Amazon EC2 instance, software can be added to the instance and an AMI is created to be used as the starting point for all compute nodes. It is typical to set up the SSH files and the .bashrc file before creating the custom AMI or “golden image.” Although many CFD solvers do not require a shared file location, one can easily be created with an exported network file system (NFS) volume, or with Amazon FSx for Lustre, an AWS managed Lustre file system.

  • API based — If preferred, an automated deployment can be developed with one of the Software Development Kits (SDKs), such as Python, available for programming an end-to-end solution.

  • CloudFormation templatesAWS CloudFormation is an AWS Cloud native approach to provisioning AWS resources based on a JSON or YAML template. AWS CloudFormation offers an easily version-controlled cluster provisioning capability.

  • AWS BatchAWS Batch is a cloud-native, container-based approach that enables CFD users to efficiently run hundreds of thousands of batch computing jobs in containers on AWS. AWS Batch dynamically provisions the optimal quantity and type of compute resources (for example, compute or memory-optimized instances) based on the volume and specific resource requirements of the batch jobs submitted.

    With AWS Batch, there is no need to install and manage batch computing software or server infrastructure that you use to run your jobs — enabling you to focus on analyzing results and solving problems. AWS Batch plans, schedules, and runs your batch computing workloads across the full range of AWS compute services and features, such as Amazon EC2 and Spot Instances. AWS Batch can also be used as a job scheduler with AWS ParallelCluster.