Overview of Performance and Optimization Options - Amazon EC2 Overview and Networking Introduction for Telecom Companies

Overview of Performance and Optimization Options

Single-Root Input/Output Virtualization (SR-IOV) is a mechanism that virtualizes a single PCIe Ethernet controller to make it appear as multiple PCIe devices. Telecom providers have been deploying SR-IOV for their virtualized Evolved Packet Core (vEPC) VNFs to obtain the required performance from their applications and to share a physical NIC among multiple VMs. One of the biggest drawbacks of using SR-IOV is the lack of support for live migration.

Figure 3 – Illustration of SR-IOV

AWS enhanced networking uses SR-IOV to provide high performance networking capabilities on supported instance types. Support of additional technologies, such as DPDK, is described in Amazon EC2 Performance Evolution and Implementation.

The Data Plane Development Kit (DPDK) consists of a set of libraries and user-space drivers to accelerate packet processing on any CPU. Designed to run in user-space, DPDK enables applications to perform their own packet processing operations directly to and from the NIC. By enabling fast packet processing, DPDK makes it possible for the telecom providers to move performance sensitive applications, such as virtualized mobile packet core and voice, to the cloud. DPDK was also identified as a key enabling technology for network functions virtualization (NFV) by ETSI. The main benefits provided by DPDK are lower latency due to kernel and TCP stack bypass, more control of packet processing, and lower CPU overhead. The DPDK libraries provide only minimal packet operations within the application, but enable receiving and sending packets with a minimum number of CPU cycles. It does not provide any networking stack and instead helps to bypass the kernel network stack to deliver high performance.

When it comes to EC2 instance support, DPDK is supported on Enhanced Networking instances, both Intel-based ixgbevf and AWS Elastic Network Adapter (ENA). All Nitro-based instances, such as C5, M5, I3, and T3, as well as Intel-based instances, such as C4, M4, and T2, provide DPDK support. The Amazon drivers, including the DPDK driver for ENA, are available on GitHub. DPDK support for ENA has been available since version 16.04. The ENA Poll Mod Driver (PMD) is a DPDK poll-mode driver for the ENA family. The ENA driver exposes a lightweight management interface with a minimal set of memory mapped registers and an extendable command set through an admin queue.

DPDK and SR-IOV are not mutually exclusive and can be used together. An SR-IOV NIC can write data on a specific VM that hosts a virtual function. The data is then consumed by a DPDK-based application. The following figure illustrates the difference in packet flow between a non-DPDK and a DPDK-optimized application:

Figure 4 – Non-DPDK vs DPDK packet path

Non-Uniform Memory Access (NUMA) is a shared memory architecture where a cluster of microprocessors in a multiprocessing system is configured so that they can share memory locally, thus improving performance and the ability of the system to be expanded. The memory access time varies with the location of the data to be accessed. If the data resides in local memory, access is fast. If the data resides in remote memory, access is slower. The advantage of the NUMA architecture as a hierarchical shared memory scheme is its potential to improve average case access time through the introduction of fast, local memory. For more information, see Optimizing Applications for NUMA.

In Amazon EC2, all instances that support more than one CPU also support NUMA. These include i3.8xlarge, r5.8xlarge, c5.8xlarge, and above.

Huge Pages can improve performance for workloads that execute large amounts of memory access. This feature of the Linux kernel enables processes to allocate memory pages of size 2MB/1GB (instead of 4K). Additionally, memory allocated using huge pages is pinned in physical memory and cannot be swapped out. Huge page support is configurable on supported instance types. The important thing to note is that huge pages make memory access faster, however you cannot overcommit memory.

CPU Pinning (CPU Affinity)

CPU Pinning is a technique that enables the binding and unbinding of a process or a thread to a CPU, or a range of CPUs, so that the process or thread will execute only on the designated CPU or CPUs rather than any CPU. This is useful when you want to dedicate vCPU to VNF and avoid sharing and dynamic rescheduling of CPUs.

AWS provides this functionality through Placement Groups. Placement groups determine how are instances placed on the underlying hardware and there are two flavors:

  • Cluster – instances can be clustered into a low latency group in a single Availability Zone. This strategy enables workloads to achieve the low-latency network performance necessary for tightly coupled node-to-node communication that is typical of high performance computing applications and latency sensitive VNFs.

  • Spread – instances can be spread across the underlying hardware to reduce correlated failures.

For more information, see Amazon EC2 Placement Groups.

Finally, to make it easier to understand AWS performance and networking capabilities, below diagram provides high-level translation of key concepts between OpenStack terms and their equivalent mapping in AWS environment:

Figure 5 – OpenStack and AWS terminology comparison