Overview of performance and optimization options - Amazon EC2 Overview and Networking Introduction for Telecom Companies

Overview of performance and optimization options

Single-Root I/O Virtualization (SR-IOV)

Single-Root I/O Virtualization (SR-IOV) is a mechanism that virtualizes a single PCIe Ethernet controller to make it appear as multiple PCIe devices. Telecom providers have been deploying SR-IOV for their virtualized 5G Packet Core VNFs and CNFs to obtain the required performance from their applications and to share a physical NIC among multiple VMs.

Diagram illustrating SR-IOV

Figure 3 – Illustration of SR-IOV

AWS enhanced networking uses SR-IOV to provide high performance networking capabilities on all current EC2 instance types, except for T2 instances.

The following methods are available for enabling enhanced networking.

Elastic Network Adapter (ENA):

Supports network speeds of up to 100 Gbps for current generation instances except for C4, D2, and M4 instances smaller than m4.16xlarge.

Intel 82599 Virtual Function (VF) interface

Supports network speeds of up to 10 Gbps for instance types C3, C4, D2, I2, M4 (excluding m4.16xlarge), and R3.

Data Plane Development Kit (DPDK)

DPDK consists of a set of libraries and user-space drivers to accelerate packet processing on any CPU. Designed to run in user-space, DPDK enables applications to perform their own packet processing operations directly to and from the NIC. By enabling fast packet processing, DPDK makes it possible for the telecom providers to move performance sensitive applications, such as virtualized mobile packet core and voice, to the cloud. DPDK was also identified as a key enabling technology for network functions virtualization (NFV) by ETSI. The main benefits provided by DPDK are lower latency due to kernel and TCP stack bypass, more control of packet processing, and lower CPU overhead. The DPDK libraries provide only minimal packet operations within the application, but enable receiving and sending packets with a minimum number of CPU cycles. It does not provide any networking stack and instead helps to bypass the kernel network stack to deliver high performance.

When it comes to EC2 instance support, DPDK is supported on Enhanced Networking instances, both Intel-based ixgbevf and AWS Elastic Network Adapter (ENA). All Nitro-based instances, such as C5, M5, I3, and T3, as well as Intel-based instances, such as C4, M4, and T2, provide DPDK support. The Amazon drivers, including the DPDK driver for ENA, are available on GitHub. DPDK support for ENA has been available since version 16.04. The ENA Poll Mod Driver (PMD) is a DPDK poll-mode driver for the ENA family. The ENA driver exposes a lightweight management interface with a minimal set of memory mapped registers and an extendable command set through an admin queue.

AWS Graviton processors are Amazon-built ARM-based custom CPUs designed to deliver price performance. While Graviton2 delivered major leap in performance and capabilities over Graviton1, Graviton3 based C7g EC2 instances can deliver up to 25% better compute performance than its predecessor. C7g instances are a great fit for telecom workloads with extensive compute requirements such as data plan functions, billing, Network Data Analytics Functions (NWDAF) and machine learning workloads. For ML workloads, Graviton3 processors deliver up to three times better performance compared to Graviton2. It also supports DDR5 memory, which increases the memory bandwidth by 50% when compared to DDR4.

DPDK and SR-IOV are not mutually exclusive and can be used together. An SR-IOV NIC can write data on a specific VM that hosts a virtual function. The data is then consumed by a DPDK-based application. The following figure illustrates the difference in packet flow between a non-DPDK and a DPDK-optimized application:

Diagram showing non-DPDK versus DPDK packet path

Figure 4 – Non-DPDK versus DPDK packet path

Non-Uniform Memory Access (NUMA)

There are multiple factors that can affect the performance of the VNFs and CNFs hosted on EC2 instance including CPU over utilization, memory use, the EBS volume, network statistics, or if the application isn't non-uniform memory access (NUMA) aware. In NUMA architecture, each CPU has access to its own assigned memory, known as local memory. Each CPU can also access memory allocated to other CPUs, known as foreign memory. If applications hosted on your instances aren't NUMA aware, then accessing the foreign memory incurs some additional costs and might affect performance.

Diagram showing NUMA architecture

Figure 5 – NUMA architecture

The memory access time varies with the location of the data to be accessed. If the data resides in local memory, access is fast. If the data resides in remote memory, access is slower. The advantage of the NUMA architecture as a hierarchical shared memory scheme is its potential to improve average case access time through the introduction of fast, local memory. For more information, see Optimizing Applications for NUMA

All EC2 instances that support more than one CPU also support NUMA. These include i4i.8xlarge, r6g.8xlarge, c6g.8xlarge, m6.8xlarge, m6i.8xlarge, and above.

Running the following command on a NUMA supported instance will provide detailed information that can be used by the VNF, CNF ISV, or both.

sudo numactl -H

Huge pages

Huge pages can improve performance for workloads that execute large amounts of memory access. This feature of the Linux kernel enables processes to allocate memory pages of size 2MB/1GB (instead of 4K). Additionally, memory allocated using huge pages is pinned in physical memory and cannot be swapped out. Huge Page support is configurable on supported instance types. The important thing to note is that Huge Pages make memory access faster, however you cannot overcommit memory.

Running the following command on an EC2 instance will provide detailed information that can be used by the VNF, CNF ISV, or both.

sudo grep Huge /proc/meminfo

CPU pinning (CPU affinity)

CPU pinning is a technique that enables the binding and unbinding of a process or a thread to a CPU, or a range of CPUs, so that the process or thread will execute only on the designated CPU or CPUs rather than any CPU. This is useful when you want to dedicate vCPU to CNF and avoid sharing and dynamic rescheduling of CPUs.

Amazon Elastic Kubernetes Service (Amazon EKS) supports Kubernetes pod workloads where CPU cache affinity and scheduling latency significantly affect workload performance, by allowing alternative CPU management policies to determine some placement preferences on the Amazon EKS self-managed worker nodes.