HPCPERF 4: How do you select your network solution? |
---|
The optimal network solution for an HPC workload varies based on latency, bandwidth, and throughput requirements. Tightly coupled HPC applications often require the lowest latency possible for network connections between compute nodes. For moderately sized, tightly coupled workloads, it is possible to select a large instance type with a large number of cores so that the application fits entirely within the instance without crossing the network at all.
Alternatively, some applications are network bound and require high network performance. Instances with higher network performance can be selected for these applications. The highest network performance is obtained with the largest instance type in a family. Refer to the instance type matrix for more details.7
Multiple instances with low latency between the instances are required for large tightly coupled applications. On AWS, this is achieved by launching compute nodes into a cluster placement group, which is a logical grouping of instances within an Availability Zone. A cluster placement group provides non-blocking and non-oversubscribed connectivity, including full bisection bandwidth between instances. Use cluster placement groups for latency sensitive tightly coupled applications spanning multiple instances.
In addition to cluster placement groups, tightly coupled applications benefit from an Elastic Fabric Adapter (EFA), a network device that can attach to your Amazon EC2 instance. EFA provides lower and more consistent latency and higher throughput than the TCP transport traditionally used in cloud-based HPC systems. It enables an OS-bypass access model through the Libfabric API that allows HPC applications to communicate directly with the network interface hardware. EFA enhances the performance of interinstance communication, is optimized to work on the existing AWS network infrastructure, and is critical for scaling tightly coupled applications.13
If an application cannot take advantage of EFA’s OS-bypass functionality, or an instance type does not support EFA, optimal network performance can be obtained by selecting an instance type that supports enhanced networking. Enhanced networking provides EC2 instances with higher networking performance and lower CPU utilization through the use of pass-through rather than hardware-emulated devices. This method allows EC2 instances to achieve higher bandwidth, higher packet-per-second processing, and lower interinstance latency compared to traditional device virtualization.
Enhanced networking is available on all current-generation instance types and requires an AMI with supported drivers. Although most current AMIs contain supported drivers, custom AMIs may require updated drivers. For more information on enabling enhanced networking and instance support, refer to the enhanced networking documentation.
Loosely coupled workloads are generally not sensitive to very low-latency networking and do not require the use of a cluster placement group or the need to keep instances in the same Availability Zone or Region.