Prepare
| HPCOPS02: How do you plan to schedule and run your batch jobs in the cloud? |
|---|
In a traditional HPC environment, there is typically a single scheduling system that is used to handle batch jobs with a variety of characteristics. In the cloud there are a number of options for job scheduling and orchestration, which address different requirements you may have.
-
Are your users comfortable with a particular scheduler already, and would they like to continue using this consolidated scheduling approach?
-
Do you need to integrate an on-premise environment with your cloud solution?
-
Do the run characteristics of your workload impact the way they are scheduled?
HPCOPS02-BP01 Evaluate options for scheduling jobs in your cloud environment
If you have an existing system, consider how you currently
schedule and manage jobs and if it meets your current
requirements. Considering whether you want to complement,
augment or replace your current system with your cloud system,
determine the level of integration needed between your hybrid
environments. Determine whether you want to integrate a
traditional scheduler with flexible cloud provisioning, use a
cloud native scheduling mechanism, such as
AWS Batch
Implementation guidance
If you have a simple workflow where you need to run a single job or a small set of jobs without the overhead of a scheduler, implement an event-driven pattern that can run your job directly and tear down the resources automatically.
In a case where batch jobs are not running continuously and there is significant time where your cloud cluster is unused, it may be worth considering additional operations of tearing down your cluster and recreating it either on a fixed schedule or on-demand. This may increase the latency with which the first job begins running and forces any environment customizations to be scripted for repeatability, but can optimize costs. It is important to separate the compute and storage requirements in such a scenario, so that the compute cluster can be deleted and recreated without affecting files on shared file systems. To carry this process even further, you may consider mounting multiple file systems into your cluster and persisting some of them but deleting others.
The infrastructure as code example on GitHub:
Event-driven
weather forecasts
-
If using a scheduler, evaluate cloud-native schedulers and traditional HPC schedulers with cloud integrations with the level of operations management.
A traditional HPC scheduler can offer benefits such as
familiarity for your system end-users, and minimal or no
changes to your existing job scripts. Implementations such as
AWS ParallelCluster enable you to leverage these
traditional schedulers while still taking advantage of cloud
benefits by scaling compute capacity up when jobs are
submitted to the scheduler, and down when there are no more
remaining jobs to optimize cost. Managed implementations such
as AWS Parallel
Computing Service
Meanwhile, cloud-native schedulers can offer reduced
operational overhead, and workflow level integrations to
abstract away concepts such as head nodes and compute pools
from end-users. They can be a great choice when running
standardized workflows and pipelines of tasks. For example
AWS Batch
You may choose to implement multiple types of scheduling solutions to suit differing applications, user needs, and job profiles. Alternatively, you might choose a traditional scheduler to meet current user expectations and modernize their workflow in phases. This is often an attractive choice in large organizations with multiple research departments with well-established workflows.
| HPCOPS03: Does your use case require data movement between separated environments, and how is this handled? |
|---|
Will your users be moving data between separated environments, such as an on-premises cluster and the cloud, and if so, do you know the predicted amount and movement patterns? Do you want to enable a seamless data management workflow for movement and archiving for ease of use, or do you want users to make an intentional choice of where they run their workloads at submission time? Have you considered alternative options to minimize the required data movement, such as remote visualization solutions? See Scenarios for additional considerations.
HPCOPS03-BP01 Understand your data movement requirements
In general, moving data back and forth regularly should be avoided unless absolutely necessary. Identify early however if data needs to be loaded to a location before a job starts and if results need to be copied out on job completion, and whether this can occur asynchronously or not. Identify how users are interacting with each of your environments, and whether they are taking advantage of the benefits of each one as far as reasonable.
For example, some environments may offer latency benefits to large datasets and some may offer more flexible hardware choices, and jobs should be distributed accordingly. Being intentional about these trade-offs at job submission time helps you verify that you are only moving data when it is beneficial to do so.
Consider extending the HPC system to include virtual desktop infrastructure (VDI) solutions and/or automated data processing steps to avoid the need for data movement for pre-processing and post-processing, centralize access control, and reduce the security exposure of your files, whilst reducing the need to manage the operations to handle regular data movement.
Implementation guidance
-
Implement remote visualization solutions to minimize data movement, saving movement time and cost, and centralizing file access controls.
Start by implementing
AWS Research and Engineering Studio to streamline your VDI
requirements while addressing other HPC management needs like
project budgeting. Another option if you are using a
traditional HPC scheduler is to implement visualization
queues, as demonstrated in the blog post for AWS ParallelCluster:
Elastic
visualization queues with Amazon DCV in AWS ParallelCluster
-
Schedule jobs to run near existing data
Configure automatic job scheduling to run computations close to data sources, eliminating separate data transfer operations. Alternatively, if user-controlled job placement is preferred, ensure data location visibility to help users select optimal computing environments. For data movement and hybrid storage considerations, see the Performance Efficiency pillar.
| HPCOPS04: How will you handle future environment updates with minimal user impact? |
|---|
With constantly improving hardware, service, and product offerings and patches, it is worthwhile considering how you can design your system upfront to allow for easy replacement of modules and phased migrations between environments with minimal effort and automated testing.
HPCOPS04-BP01 Minimize impact when migrating users and their jobs between HPC environments
Standardize access across HPC environments, and retain and migrate data in the cases of environment upgrades or migrations. If you are using a scheduling mechanism, understand how these can be migrated to different environments, and the impact on running jobs. In some HPC cluster environments there are long running jobs that run over time periods that may otherwise be used as maintenance windows such as weekends. In such cases you may consider having a longer period for blue or green migrations, where new jobs are submitted to the new cluster and the old cluster is given multiple days before deleting to complete all jobs.
Separate your file system from the lifecycle of your HPC
environment, and implement regular backups. For example, while a
tool like AWS ParallelCluster is able to create an
FSx for Lustre
Consider decoupling the cluster access from the user access, for
example with an
Application Load Balancer (ALB),
Elastic
IP addresses, or abstracting the submission to the
scheduler with a user facing submission form that can be
connected to different schedulers, such as
Open
OnDemand
Implementation guidance
Manage your data operations separately to the lifecycle of your compute environment.
When migrating between compute environments, users' data should be preserved as far as possible. Create file systems separately to the infrastructure as code stacks that define the compute environments, and reference the file systems to import them into your cluster where possible. If using AWS ParallelCluster for example, you should mount existing file systems in the SharedStorage section of the cluster configuration file. You can then handle the operations of different compute environments flexibly, and for example integrate different compute orchestration services, while providing a single location for end-users to store their data.
Educate users if you intend to treat particular file systems as ephemeral or scratch, so that they know any data stored on these file systems will be lost between environment changes. This may be desirable for some use cases where temporary data is created during job runs and not automatically deleted, and in such cases an intentional choice can be made to not carry this data between clusters. This also allows you to handle the operations of your data in a more tailored way. For example, performing automated backups of your persistent file systems, whilst optimizing costs on your scratch file systems.
HPCOPS04-BP02 Implement your environments with infrastructure as code and version control your deployments
Implementing your environment with infrastructure as code (IaC) as much as feasible, and complementing it with clear documentation steps for components that cannot be automated allows you to automate your deployments. These templates can then also be put under version control, which allows you to track changes between deployment versions, and provides a centralized location for different stakeholders in your organization to observe and approve operational changes. This also gives you the ability to reproduce environments for use cases such as results verification and reproducibility, and the ability to fail back to old versions in the case of regressions.
See Operational excellence pillar of the AWS Well Architected whitepaper for guidance on using infrastructure as code for your deployments. There are a few common customizations to this for HPC environments. One aspect is that HPC codes are often compiled such that a shared POSIX compatible file system is required, and these compilations can also be lengthy. Therefore, it often makes sense to leave the shared file system which stores the applications running even if the rest of the environment scales up and down elastically.
Another aspect is that HPC instance capacity may be deployed in specific availability zones (AZ). If you use this capacity, parametrize or create a mapping of desired availability zones in your infrastructure as code (IaC) templates to keep them flexible across regions. Similarly, if you have deployed file systems in a particular availability zone that need to be mounted to your environment, your cluster should also be deployed in that availability zone to prevent data transfer between availability zones.
Implementation guidance
-
Utilize tools such as AWS CloudFormation and bootstrap scripts to define your HPC environments with code.
Tools such as AWS ParallelCluster themselves are forms of IaC,
but can also sit within broader AWS CloudFormation IaC
scripts, as detailed in the tutorial:
Creating
a cluster with AWS CloudFormation. This allows you to
provision full stack deployments from these scripts, and
examples of such environments and helper scripts for further
customization are documented in the
HPC
Recipes for AWS
| HPCOPS05: How will your system respond to failures and anomalies? |
|---|
-
Have you designed your architecture to mitigate predictable failure modes of your system and user jobs?
-
How easy will it be to diagnose and correct various error sources, and are there opportunities to automate responses?
While we will test the system and implement recovery strategies in the Reliability pillar of the lens, planning for predictable failure modes and working backwards to architect solutions will have implications for your operational decisions.
HPCOPS05-BP01 Predict how your system will respond to failures and design your operational management to mitigate these
For some of the potential failure modes you identify, you may be able to mitigate them entirely by considering alternative services and products early that reduce your operational burden. For others you may have to implement automated responses or documented runbooks as part of your reliability planning.
Specifically, for HPC environments, there are a number of operational procedures you can modify when considering failure modes. Determine and configure the behavior of your scheduler in the case of compute node failures, for example if it resubmits jobs and/or notifies users. If the head node is self-hosted, consider designing procedures to handle its failure. For example, you may choose to implement an alerting operation so you can manually intervene or opt to add an active failover head node to avoid interruption in cluster operations.
For tightly coupled HPC jobs, architecting for job-level resiliency at runtime may come at the expense of job performance or not be possible at all (i.e. any compute node failure will result in total job failure), and so alternatives such as checkpointing your state for long running jobs and resubmitting jobs automatically in the case of infrastructure failure should be implemented where possible.
Implementation guidance
-
Minimize your operational burden by choosing managed services where possible and configuring your environment to automate recovery.
Choosing managed services and features for your storage and
scheduling systems reduces your operational burden, for
example AWS Parallel Computing Service
For example, if using AWS Batch you can implement Automated job retries strategies to take action based on the reason for failure.