Compute - High Performance Computing Lens

Compute

HPCPERF 1: How do you select your compute solution?

The optimal compute solution for a particular HPC architecture depends on the workload deployment method, degree of automation, usage patterns, and configuration. Different compute solutions may be chosen for each step of a process. Selecting the wrong compute solutions for an architecture can lead to lower performance efficiency.

Instances are virtualized servers and come in different families and sizes to offer a wide variety of capabilities. Some instance families target specific workloads, for example, compute-, memory-, or GPU-intensive workloads. Other instances are general purpose.

Both the targeted-workload and general-purpose instance families are useful for HPC applications. Instances of particular interest to HPC include the compute-optimized family and accelerated instance types such as GPUs and FPGAs.

Some instance families provide variants within the family for additional capabilities. For example, an instance family may have a variant with local storage, greater networking capabilities, or a different processor. These variants can be viewed in the Instance Type Matrix and may improve the performance of some HPC workloads.

Within each instance family, one or more instance sizes allow vertical scaling of resources. Some applications require a larger instance type (for example, 24xlarge) while others run on smaller types (for example, large) depending on the number or processes supported by the application. The optimum performance is obtained with the largest instance type when working with a tightly coupled workload.

The T-series instance family is designed for applications with moderate CPU usage that can benefit from bursting beyond a baseline level of CPU performance. Most HPC applications are compute-intensive and suffer a performance decline with the T-series instance family.

Applications vary in their requirements (for example, desired cores, processor speed, memory requirements, storage needs, and networking specifications). When selecting an instance family and type, begin with the specific needs of the application. Instance types can be mixed and matched for applications requiring targeted instances for specific application components.

Containers are a method of operating system virtualization that is attractive for many HPC workloads, particularly if the applications have already been containerized. AWS services such as AWS Batch, Amazon Elastic Container Service (ECS), and Amazon Elastic Container Service for Kubernetes (EKS) help deploy containerized applications.

Functions abstract the execution environment. AWS Lambda allows you to execute code without deploying, running, or maintaining, an instance. Many AWS services emit events based on activity inside the service, and often a Lambda function can be triggered off of service events. For example, a Lambda function can be executed after an object is uploaded to Amazon S3. Many HPC users use Lambda to automatically execute code as part of their workflow.

There are several choices to make when launching your selected compute instance:

  • Operating system: A current operating system is critical to achieving the best performance and ensuring access to the most up-to-date libraries.

  • Virtualization type: New-generation EC2 instances run on the AWS Nitro System. The Nitro System delivers all the host hardware’s compute and memory resources to your instances, resulting in better overall performance. Dedicated Nitro Cards enable high-speed networking, high-speed EBS, and I/O acceleration. Instances do not hold back resources for management software.

The Nitro Hypervisor is a lightweight hypervisor that manages memory and CPU allocation and delivers performance that is indistinguishable from bare metal. The Nitro System also makes bare metal instances available to run without the Nitro Hypervisor. Launching a bare metal instance boots the underlying server, which includes verifying all hardware and firmware components. This means it can take longer before the bare metal instance becomes available to start your workload, as compared to a virtualized instance. The additional initialization time must be considered when operating in a dynamic HPC environment where resources launch and terminate based on demand.

HPCPERF 2: How do you optimize the compute environment for your application?

Underlying hardware features: In addition to choosing an AMI, you can further optimize your environment by taking advantage of the hardware features of the underlying Intel processors. There are four primary methods to consider when optimizing the underlying hardware:

  1. Advanced processor features

  2. Intel Hyper-Threading Technology

  3. Processor affinity

  4. Processor state control

HPC applications can benefit from these advanced processor features (for example, Advanced Vector Extensions) and can increase their calculation speeds by compiling the software for the Intel architecture. The compiler options for architecture-specific instructions vary by compiler (check the usage guide for your compiler).

AWS enables Intel Hyper-Threading Technology, commonly referred to as “hyperthreading,” by default. Hyperthreading improves performance for some applications by allowing one process per hyperthread (two processes per core). Most HPC applications benefit from disabling hyperthreading, and therefore, it tends to be the preferred environment for HPC applications. Hyperthreading is easily disabled in Amazon EC2. Unless an application has been tested with hyperthreading enabled, it is recommended that hyperthreading be disabled and that processes are launched and individually pinned to cores when running HPC applications. CPU or processor affinity allows process pinning to easily happen.

Processor affinity can be controlled in a variety of ways. For example, it can be configured at the operating system level (available in both Windows and Linux), set as a compiler flag within the threading library, or specified as an MPI flag during execution. The chosen method of controlling processor affinity depends on your workload and application.

AWS enables you to tune the processor state control on certain instance types. You may consider altering the C-state (idle states) and P-state (operational states) settings to optimize your performance. The default C-state and P-state settings provide maximum performance, which is optimal for most workloads. However, if your application would benefit from reduced latency at the cost of higher single- or dual-core frequencies, or from consistent performance at lower frequencies as opposed to spiky Turbo Boost frequencies, experiment with the C-state or P-state settings available on select instances.

There are many compute options available to optimize a compute environment. Cloud deployment allows experimentation on every level from operating system to instance type, to bare-metal deployments. Because static clusters are tuned before deployment, time spent experimenting with cloud-based clusters is vital to achieving the desired performance.