Optimizing for performance - AWS Graviton2 for Independent Software Vendors

Optimizing for performance

This section offers guidance for optimizing your source code to run on AWS Graviton2 instances. Due to the difference in CPU architecture, running code that is not optimized for the AWS Graviton2 processor may result in suboptimal performance. At the end of this section, you should be able to identify these differences and understand how to remediate them so you can make the most of AWS Graviton2.

Understand key advantages

One of the major differences between AWS Graviton2 instance types and other instance types is their vCPU to physical processor core mapping. Every vCPU on a Graviton2 processor is a physical core. This means there is no Simultaneous Multi-Threading (SMT) and more isolation between vCPUs. By contrast, every vCPU on a 5th generation instance type with Intel processor (such as M5, C5, and R5) is a hyper-thread. This means vCPUs share resources and there is less isolation than in the case of Graviton2.

Key advantages of the AWS Graviton2 processor:

  • Feature sets optimized for cloud workloads, reducing the overheads of interrupts and context switching.

  • Large L1 and L2 caches for every vCPU which means a large portion of your workload will fit in cache without having to go to memory.

  • Every vCPU is a physical core, meaning more isolation between vCPUs and no resource sharing except the last level cache and memory system.

  • Cores connected together in a mesh with ~2TB/s of bisection bandwidth, allowing applications to move very quickly from core to core when sharing data.

  • No NUMA concerns, meaning every core sees the same latency to every other core and to DRAM.

Use optimized compiler flags

When targeting modern processors, the right compiler flags can lead to considerably better performance. New versions of compilers will begin emitting instructions specific to the CPU architecture by default. However, until they do, specific compiler flags allow you to enable new features to obtain higher performance for tasks handled by the CPU such as managing a mutex. For example, AWS Graviton2 processors implement new atomic instructions (referred to as large-system extensions or LSE) to support scalable performance on larger instance sizes for applications implementing synchronization locks (e.g., databases). In this specific case using the outline-atomics flag will yield almost the same benefits, but results in code that is backward compatible (i.e., will run on older ARM CPUs like Graviton1-based Amazon EC2 A1 instances). For more information on using compiler flags, see the AWS Graviton Getting Started Guide.

Upgrade operating systems

Use the most recent 64-bit ARM release of your operating system whenever possible (such as Amazon Linux 2 and Ubuntu 20.04). The latest release of your operating system offers libraries that were built with optimized compiler flags (as previously mentioned, this can lead to considerable differences in performance). For example, the latest 64-bit ARM release for Amazon Linux 2 ships with libraries such as libc that are optimized for AWS Graviton2. This leads to better performance in more scenarios.

Tune low level code

Identify and tune any low-level code that includes architecture-specific CPU instructions. While uncommon in application programming, some source code or libraries may use highly optimized inline assembly code to achieve maximum performance from a particular CPU architecture. Due to different instruction sets implemented by different CPU architectures, source code that offers only one optimized implementation (such as x86) will not perform well on Graviton2 (which is ARM-based) so the application may fall back to a generic, slower implementation which means you will not see the full performance entitlement of AWS Graviton2 instances. Identifying such source code highlights performance critical routines that should also be implemented for AWS Graviton2.

Test performance on multiple instance sizes

When doing performance benchmarking, evaluate both ends of the instance size spectrum to detect performance bottlenecks that may occur only on the small or very large instance sizes of an instance family. For example, performance bottlenecks may only occur on the larger sizes of an instance family while your software performs well on the smaller instance sizes of the family. Benchmark multiple instance sizes in a systematic fashion to detect such bottlenecks and provide sizing guidance to your cloud operations team or end customers to help guide their instance size selection criteria.