General Design Principles - High Performance Computing Lens

General Design Principles

In traditional computing environments, architectural decisions are often implemented as static, one-time events, sometimes with no major software or hardware upgrades during a computing system’s lifetime. As a project and its context evolve, these initial decisions may hinder the system’s ability to meet changing business requirements.

It’s different in the cloud. A cloud infrastructure can grow as the project grows, allowing for a continuously optimized capability. In the cloud, the capability to automate and test on demand lowers the risk of impact from infrastructure design changes. This allows systems to evolve over time so that projects can take advantage of innovations as a standard practice.

The Well-Architected Framework proposes a set of general design principles to facilitate good design in the cloud with high-performance computing:

  • Dynamic architectures: Avoid frozen, static architectures and cost estimates that use a steady-state model. Your architecture must be dynamic: growing and shrinking to match your demands for HPC over time. Match your architecture design and cost analysis explicitly to the natural cycles of HPC activity. For example, a period of intense simulation efforts might be followed by a reduction in demand as the work moves from the design phase to the lab. Or, a long and steady data accumulation phase might be followed by a large-scale analysis and data reduction phase. Unlike many traditional supercomputing centers, the AWS Cloud helps you avoid long queues, lengthy quota applications, and restrictions on customization and software installation. Many HPC endeavors are intrinsically bursty and well-matched to the cloud paradigms of elasticity and pay-as-you-go. The elasticity and pay-as-you-go model of AWS eliminates the painful choice between oversubscribed systems (waiting in queues) or idle systems (wasted money). Environments, such as compute clusters, can be “right-sized” for a given need at any given time.

  • Align the procurement model to the workload: AWS makes a range of compute procurement models available for the various HPC usage patterns. Selecting the correct model ensure that you are only paying for what you need. For example, a research institute might run the same weather forecast application in different ways:

    • An academic research project investigates the role of a weather variable with a large number of parameter sweeps and ensembles. These simulations are not urgent, and cost is a primary concern. They are a great match for Amazon EC2 Spot Instances. Spot Instances let you take advantage of Amazon EC2 unused capacity and are available at up to a 90% discount compared to On-Demand prices.

    • During the wildfire season, up-to-the-minute local wind forecasts ensure the safety of firefighters. Every minute of delay in the simulations decreases their chance of safe evacuation. On-Demand Instances must be used for these simulations to allow for the bursting of analyses and ensure that results are obtained without interruption.

    • Every morning, weather forecasts are run for television broadcasts in the afternoon. Scheduled Reserved Instances can be used to make sure that the needed capacity is available every day at the right time. Use of this pricing model provides a discount compared with On-Demand Instances.

  • Start from the data: Before you begin designing your architecture, you must have a clear picture of the data. Consider data origin, size, velocity, and updates. A holistic optimization of performance and cost focuses on compute and includes data considerations. AWS has a strong offering of data and related services, including data visualization, which enables you to extract the most value from your data.

  • Automate to simplify architectural experimentation: Automation through code allows you to create and replicate your systems at low cost and avoid the expense of manual effort. You can track changes to your code, audit their impact, and revert to previous versions when necessary. The ability to easily experiment with infrastructure allows you to optimize the architecture for performance and cost. AWS offers tools, such as AWS ParallelCluster, that help you get started with treating your HPC cloud infrastructure as code.

  • Enable collaboration: HPC work often occurs in a collaborative context, sometimes spanning many countries around the world. Beyond immediate collaboration, methods and results are often shared with the wider HPC and scientific community. It’s important to consider in advance which tools, code, and data may be shared, and with whom. The delivery methods should be part of this design process. For example, workflows can be shared in many ways on AWS: you can use Amazon Machine Images (AMIs), Amazon Elastic Block Store (Amazon EBS) snapshots, Amazon Simple Storage Service (Amazon S3) buckets, AWS CloudFormation templates, AWS ParallelCluster configuration files, AWS Marketplace products, and scripts. Take full advantage of the AWS security and collaboration features that make AWS an excellent environment for you and your collaborators to solve your HPC problems. This helps your computing solutions and datasets achieve a greater impact by securely sharing within a selective group or publicly sharing with the broader community.

  • Use cloud-native designs: It is usually unnecessary and suboptimal to replicate your on-premises environment when you migrate workloads to AWS. The breadth and depth of AWS services enables HPC workloads to run in new ways using new design patterns and cloud-native solutions. For example, each user or group can use a separate cluster, which can independently scale depending on the load. Users can rely on a managed service, like AWS Batch, or serverless computing, like AWS Lambda, to manage the underlying infrastructure. Consider not using a traditional cluster scheduler, and instead use a scheduler only if your workload requires it. In the cloud, HPC clusters do not require permanence and can be ephemeral resources. When you automate your cluster deployment you can terminate one cluster and launch a new one quickly with the same or different parameters. This method creates environments as necessary.

  • Test real-world workloads: The only way to know how your production workload will perform in the cloud is to test it on the cloud. Most HPC applications are complex, and their memory, CPU, and network patterns often can’t be reduced to a simple test. Also, application requirements for infrastructure vary based on which application solvers (mathematical methods or algorithms) your models use, the size and complexity of your models, etc. For this reason, generic benchmarks aren’t reliable predictors of actual HPC production performance. Similarly, there is little value in testing an application with a small benchmark set or “toy problem.” With AWS, you only pay for what you actually use; therefore, it is feasible to do a realistic proof-of-concept with your own representative models. A major advantage of a cloud-based platform is that a realistic, full-scale test can be done before migration.

  • Balance time-to-results and cost reduction: Analyze performance using the most meaningful parameters: time and cost. Focus on cost optimization should be used for workloads that are not time-sensitive. Spot Instances are usually the least expensive method for non-time-critical workloads. For example, if a researcher has a large number of lab measurements that must be analyzed sometime before next year’s conference, Spot Instances can help analyze the largest possible number of measurements within the fixed research budget. Conversely, for time-critical workloads, such as emergency response modeling, cost optimization can be traded for performance, and instance type, procurement model, and cluster size should be chosen for lowest and most immediate execution time. If comparing platforms, it’s important to take the entire time-to-solution into account, including non-compute aspects such as provisioning resources, staging data, or, in more traditional environments, time spent in job queues.