Operate - High Performance Computing Lens


Operations must be standardized and managed routinely. Focus on automation, small frequent changes, regular quality assurance testing, and defined mechanisms to track, audit, roll back, and review changes. Changes should not be large and infrequent, should not require scheduled downtime, and should not require manual execution. A wide range of logs and metrics based on key operational indicators for a workload must be collected and reviewed to ensure continuous operations.

AWS provides the opportunity to use additional tools for handling HPC operations. These tools can vary from monitoring assistance to automating deployments. For example, you can have Auto Scaling restart failed instances, use CloudWatch to monitor your cluster’s load metrics, configure notifications for when jobs finish, or use a managed service (such as AWS Batch) to implement retry rules for failed jobs. Cloud-native tools can greatly improve your application deployment and change management.

Release management processes, whether manual or automated, must be based on small incremental changes and tracked versions. You must be able to revert releases that introduce issues without causing operational impact. Use continuous integration and continuous deployment tools such as AWS CodePipeline and AWS CodeDeploy to automate change deployment. Track source code changes with version control tools, such as AWS CodeCommit, and infrastructure configurations with automation tools, such as AWS CloudFormation templates.

HPCOPS 3: How are you evolving your workload while minimizing the impact of change?
HPCOPS 4: How do you monitor your workload to ensure that it is operating as expected?

Using the cloud for HPC introduces new operational considerations. While on-premises clusters are fixed in size, cloud clusters can scale to meet demand. Cloud-native architectures for HPC also operate differently than on-premises architectures. For example, they use different mechanisms for job submission and provisioning On-Demand Instance resources as jobs arrive. You must adopt operational procedures that accommodate the elasticity of the cloud and the dynamic nature of cloud-native architectures.