Design Principles
In the cloud, a number of principles help you increase reliability. In particular, the following are emphasized for HPC workloads. For more information, refer to the design principles in the AWS Well-Architected Framework whitepaper.
-
Scale horizontally to increase aggregate system availability: It is important to consider horizontal scaling options that might reduce the impact of a single failure on the overall system. For example, rather than having one large, shared HPC cluster running multiple jobs, consider creating multiple clusters across the Amazon infrastructure to further isolate your risk of potential failures. Since infrastructure can be treated as code, you can horizontally scale resources inside a single cluster, and you can horizontally scale the number of clusters running individual cases.
-
Stop guessing capacity: A set of HPC clusters can be provisioned to meet current needs and scaled manually or automatically to meet increases or decreases in demand. For example, terminate idle compute nodes when not in use, and run concurrent clusters for processing multiple computations rather than waiting in a queue.
-
Manage change in automation: Automating changes to your infrastructure allows you to place a cluster infrastructure under version control and make exact duplicates of a previously created cluster. Automation changes must be managed.