Failure Management - High Performance Computing Lens

Failure Management

Any complex system can expect for failures to occasionally occur, and it critical to become aware of these failures, respond to them, and prevent them from happening again. Failure scenarios can include the failure of a cluster to start up, or the failure of a specific workload.

HPCREL 2: How does your application use checkpointing to recover from failures?

Failure tolerance can be improved in multiple ways. For long-running cases, incorporating regular checkpoints in your code allows you to continue from a partial state in the event of a failure. Checkpointing is a common feature of application-level failure management already built into many HPC applications. The most common approach is for applications to periodically write out intermediate results. The intermediate results offer potential insight into application errors and the ability to restart the case as needed while only partially losing the work.

Checkpointing is useful on Spot Instances when you are using highly cost-effective, but potentially interruptible instances. In addition, some applications may benefit from changing the default Spot interruption behavior (for example, stopping or hibernating the instance rather than terminating it). It is important to consider the durability of the storage option when relying on checkpointing for failure management.

HPCREL 3: How have you planned for failure tolerance in your architecture?

Failure tolerance can be improved when deploying to multiple Availability Zones. The low-latency requirements of tightly coupled HPC applications require that each individual case reside within a single cluster placement group and Availability Zone. Alternatively, loosely coupled applications do not have such low-latency requirements and can improve failure management with the ability to deploy to several Availability Zones.

Consider the tradeoff between the reliability and cost pillars when making this design decision. Duplication of compute and storage infrastructure (for example, a head node and attached storage) incurs additional cost, and there may be data transfer charges for moving data to an Availability Zone or to another AWS Region. For non-urgent use cases, it may be preferable to only move into another Availability Zone as part of a disaster recovery (DR) event.