Migration approaches, patterns, and anti‑patterns

Many financial services organizations already have some form of HPC environment hosted in an on-premises data center. If you’re migrating from such an implementation, it’s important to consider what might be the best method to complete the migration. The optimal approach depends on the desired outcome, risk appetite, and timescale, but typically begins with one of the 6 Rs: Rehosting, Replatforming, Repurchasing, Refactoring/Re-architecting, and (to a lesser degree) Retiring, or Retaining (revisiting).

HPC cloud migrations typically progress through the following stages. Firstly, they begin on premises; then, bursting to cloud to augment capacity, customers ‘lift and shift’ whole clusters to the cloud; optimizations are then applied to make the most of new provisioning models before finally exploring ‘cloud native’ solution. The nuances and timings of each stage depends on the individual businesses involved.

The first stage customers explore is Bursting capacity. In this mode, very little changes with the existing on-premises HPC environment. However, at times of peak demand, Amazon EC2 instances can be created and added to the system to provide additional capacity. The trigger for the creation of these instances is usually either:

Scheduled – If workloads are predictable in terms of timing and scale, then a simple schedule to add and remove a fixed number of hosts at predefined times can be effective. The schedule can be managed by an on-premises system, or with Amazon EventBridge rules.
Demand based – In this mode, a component can monitor the performance of workloads, and add or remove capacity based on demand. If a task queue starts to increase, additional instances can be requested through the AWS API, and if the queue decreases, some instances can be removed.
Predictive – In some cases, especially when the startup time for a new instance is long (perhaps because of very large package dependencies or complex OS builds), it might be desirable to use a simple machine learning model to analyze historic demand and determine when to bring capacity online. This approach is rare, but can work well when combined with a demand-based approach.

As customers build confidence in their ability to supplement existing capacity with cloud-based instances, they often make a decision to complete a migration. However, with existing on-premises hardware still available, customers want to keep the value of that infrastructure before it can be decommissioned. In this case, it can make sense to provision a new strategic grid — with all of the same scheduler components — into the cloud, and retain the existing on-premises grid. It’s then left to the upstream clients to direct workloads accordingly, switching to the cloud-based grid as the on-premises capacity is gradually retired.

When you have completed migration and are running all of their HPC workloads in the cloud, the on-premises infrastructure can be removed. At this point, you have completed a Rehosting approach. When your infrastructure is in the cloud, you then have the flexibility to look at Replatforming or Refactoring your environment. The ability to build entirely new architectures in the cloud alongside existing production systems means that new approaches can be fully tested before they’re put into production.

One anti-pattern that’s occasionally proposed by customers involves platform stacking. In this approach, solutions such as virtualization and/or container platforms are placed under the HPC platform to try to create portability or parity between cloud-based systems and on-premises systems. This approach can have some disadvantages:

Computational inefficiency – By adding more layers between the analytics binaries and CPUs performance, computational efficiency is inevitably degraded as CPU cycles are consumed by the abstraction layers.
Licensing costs – HPC environments are large and continue to grow. Though enterprise licenses can keep the upfront costs of using these technologies very low, the large number of CPU cores involved in HPC workloads can mean significant additional costs when the licenses are due for renewal.
Management overhead – In the simplest approach, an Amazon EC2 instance can be created on demand using an Amazon Linux 2 AMI. This AMI is patched and up to date and because it exists for just a few hours, it requires no further management. However, by building HPC stacks on top of other abstractions, those long-running layers need patching and upgrading, and when multiple layers are involved, the scope for disruption through planned maintenance or an unplanned outage increases significantly.
Scaling challenges – Amazon EC2 instances can be available very quickly on demand. If scaling out involves the creation of a complex stack before processes can run, this adds to the billing time of the instance before useful work can be done. In worst-case scenarios, there can be a temptation to leave large numbers of instances running so that they’re available if additional workloads arise.
Optimization challenges – HPC systems are already complex, especially when supporting huge volumes of variable workloads with different CPU and memory requirements. Knowing where CPU and memory resources are consumed is vital to identifying bottlenecks or debugging failures. If an HPC platform is based on a series of abstraction layers, this can introduce additional variables that make it difficult to see where inefficiencies exist, and as a result they might never be found.
Security challenges – Securing a more complex stack can be challenging because there are more components to configure, monitor, and maintain to ensure the integrity of the system.

By defining portability in terms of a virtual machine image or a Docker image, you can find a good balance of portability while off-setting some of the disadvantages through the use of cloud-native virtualization with Amazon EC2 and/or container management solutions such as Amazon ECS and EKS, especially when combined with AWS Fargate.

Keeping HPC systems as simple as possible provides the best performance at the lowest cost. Most HPC solutions are already platforms by design and offer portability through simple deployment patterns to standard operating systems.

Once fully in the cloud, the next logical step for an HPC team is to focus on optimization. Ultimately, getting the calculations done at the right time and at the lowest cost is a key measure of success and the cloud offers more levers to control how this is achieved, in particular in relationship to the supply of compute. The process of optimization could be triggered by a new customer workload pattern, the availability of a new EC2 Instance type or a price reduction. Fortunately, the elastic nature of cloud means that it’s easy to experiment, to review the results, and to make changes without the risk of a costly mistake.

Once the team is confident in optimizing a shared HPC cluster, it’s common to then review the multi-tenant approach entirely. Multi-tenancy was instrumental in driving up the overall utilization of large on-premises HPC clusters, allowing any user group with pending tasks to make use of idle resources. However, the cloud’s ability to right size according to the demands of the day means that there should be very little idle capacity available. This has led to a trend where an organization’s consumers of HPC have their own individual clusters, each dynamically sized and optimized to their individual needs without any increase in aggregate costs.

Such federated models typically see individual consumer groups (like a trading desk or particular business line) have their own AWS accounts within which they are free to optimize for their individual workloads while benefitting from standardized tooling and expert guidance from the central HPC team. Separating consumers increases agility as changes need only be tested within a more limited scope. Additionally, the separate accounts make it easier to understand the costs associated with an instance of the HPC environment and the cost impact of any changes made.

Lastly, customers are increasingly looking to the future as they make longer-term plans for HPC. The overall trend is towards serverless technologies which can abstract away many of the concerns of availability and capacity.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Task scheduling and infrastructure orchestration

Conclusion