Monitoring and tuning - Performance at Scale with Amazon ElastiCache

Monitoring and tuning

Before we wrap up, let's spend some time talking about monitoring and performance tuning.

Monitoring cache efficiency

To begin, see the Monitoring Use with CloudWatch topic for Redis and Memcached, as well as the Which Metrics Should I Monitor? topic for Redis and Memcached in the Amazon ElastiCache User Guide. Both topics are excellent resources for understanding how to measure the health of your ElastiCache cluster using the metrics that ElastiCache publishes to Amazon CloudWatch. Most importantly, watch CPU usage. A consistently high CPU usage indicates that a node is overtaxed, either by too many concurrent requests, or by performing dataset operations in the case of Redis.

For Redis, ElastiCache provides two different types of metrics for monitoring CPU usage: CPUUtilization and EngineCPUUtilization. Because Redis is single-threaded, you need to multiply the CPU percentage by the number of cores to get an accurate measure of CPUUtilization. For smaller node types with one or two vCPUs, use the CPUUtilization metric to monitor your workload. For larger node types with four or more vCPUs, we recommend monitoring the EngineCPUUtilization metric, which reports the percentage of usage on the Redis engine core.

After Redis maxes out a single CPU core, that node is fully utilized, and further scaling is needed. If your main workload is from read requests, add more replicas to distribute the read workloads across the replicas and reader endpoints. If your main workload is from write requests, add more shards to distribute the write workload across more primary nodes.

In addition to CPU, here is some additional guidance for monitoring cache memory utilization. Each of these metrics is available in CloudWatch for your ElastiCache cluster:

  • Evictions—both Memcached and Redis manage cache memory internally, and when memory starts to fill up they evict (delete) unused cache keys to free space. A small number of evictions shouldn't alarm you, but a large number means that your cache is running out of space.

  • CacheMisses—the number of times a key was requested but not found in the cache. This number can be fairly large if you're using lazy population as your main strategy. If this number is remaining steady, it's likely nothing to worry about. However, a large number of cache misses combined with a large eviction number can indicate that your cache is thrashing due to lack of memory.

  • BytesUsedForCacheItems—this value is the actual amount of cache memory that Memcached or Redis is using. Both Memcached and Redis attempt to allocate as much system memory as possible, even if it's not used by actual cache keys. Thus, monitoring the system memory usage on a cache node doesn't tell you how full your cache actually is.

  • SwapUsage—in normal usage, neither Memcached nor Redis should be performing swaps.

  • Currconnections—this is a cache engine metric representing the number of clients connected to the engine. We recommend that you determine your own alarm threshold for this metric based on your application needs. An increasing number of CurrConnections might indicate a problem with your application—you’ll need to investigate the application’s behavior to address this issue.

A well-tuned cache node will show the number of cache bytes used to be almost equal to the maxmemory parameter in Redis, or the max_cache_memory parameter in Memcached. In steady state, most cache counters will increase, with cache hits increasing faster than misses. You also will probably see a low number of evictions. However, a rising number of evictions indicates that cache keys are getting pushed out of memory, which means you can benefit from larger cache nodes with more memory.

The one exception to the evictions rule is if you follow a strict definition of Russian doll caching, which says that you should never cause cache items to expire, but instead let Memcached and Redis evict unused keys as needed. If you follow this approach, keep a close watch on cache misses and bytes used to detect potential problems.

Watching for hot spots

In general, if you are using consistent hashing to distribute cache keys across your cache nodes, your access patterns should be fairly even across nodes. However, you still need to watch out for hot spots, which are nodes in your cache that receive higher load than other nodes. This pattern is caused by hot keys, which are cache keys that are accessed more frequently than others. Think of a social website, where you have some users that might be 10,000 times more popular than an average user. That user's cache keys will be accessed much more often, which can put an uneven load onto the cache nodes that house that user's keys.

If you see uneven CPU usage among your cache nodes, you might have a hot spot. This pattern often appears as one cache node having a significantly higher operation count than other nodes. One way to confirm this is by keeping a counter in your application of your cache key gets and puts. You can push these as custom metrics into CloudWatch, or another monitoring service. Don't do this unless you suspect a hot spot, however, because logging every key access will decrease the overall performance of your application.

In the most common case, a few hot keys will not necessarily create any significant hot spot issues. If you have a few hot keys on each of your cache nodes, then those hot keys are themselves evenly distributed, and are producing an even load on your cache nodes. If you have three cache nodes and each of them has a few hot keys, then you can continue sizing your cache cluster as if those hot keys did not exist. In practice, even a well-designed application will have some degree of unevenness in cache key access.

In extreme cases, a single hot cache key can create a hot spot that overwhelms a single cache node. In this case, having good metrics about your cache, especially your most popular cache keys, is crucial to designing a solution. One solution is to create a mapping table that remaps very hot keys to a separate set of cache nodes. Although this approach provides a quick fix, you will still face the challenge of scaling those new cache nodes. Another solution is to add a secondary layer of smaller caches in front of your main nodes, to act as a buffer. This approach gives you more flexibility, but introduces additional latency into your caching tier.

The good news is that these concerns only hit applications of a significant scale. We recommend being aware of this potential issue and monitoring for it, but not spending time trying to engineer around it up front. Hot spots are a fast-moving area of computer science research, and there is no one-size-fits-all solution. As always, our team of Solutions Architects is available to work with you to address these issues if you encounter them. For more research on this topic, refer to papers such as Relieving Hot Spots on the World Wide Web and Characterizing Load Imbalance in Real-World Networked Caches.

Memcached memory optimization

Memcached uses a slab allocator, which means that it allocates memory in fixed chunks, and then manages those chunks internally. Using this approach, Memcached can be more efficient and predictable in its memory access patterns than if it used the system malloc(). The downside of the Memcached slab allocator is that memory chunks are rigidly allocated once and cannot be changed later. This approach means that if you choose the wrong number of the wrong size slabs, you might run out of Memcached chunks while still having plenty of system memory available.

When you launch an ElastiCache cluster, the max_cache_memory parameter is set for you automatically, along with several other parameters. For a list of default values, see Memcached specific parameters in the Amazon ElastiCache for Memcached User Guide. The key parameters to keep in mind are chunk_size and chunk_size_growth_factor, which work together to control how memory chunks are allocated.

Redis memory optimization

Redis has a good write-up on memory optimization that can come in handy for advanced use cases. Redis exposes a number of Redis configuration variables that will affect how Redis balances CPU and memory for a given dataset. These directives can be used with ElastiCache for Redis as well.

Redis backup and restore

Redis clusters support persistence by using backup and restore. When Redis backup and restore is enabled, ElastiCache can automatically take snapshots of your Redis cluster and save them to Amazon Simple Storage Service (Amazon S3). The Amazon ElastiCache User Guide includes excellent coverage of this function in the topic Backup and restore for ElastiCache for Redis.

Because of the way Redis backups are implemented in the Redis engine itself, you need to have more memory available that your dataset consumes. This requirement is because Redis forks a background process that writes the backup data. To do so, it makes a copy of your data, using Linux copy-on-write semantics. If your data is changing rapidly, this approach means that those data segments will be copied, consuming additional memory.

For production use, we strongly recommend that you always enable Redis backups, and retain them for a minimum of 7 days. In practice, retaining them for 14 or 30 days will provide better safety in the event of an application bug that ends up corrupting data.

Even if you plan to use Redis primarily as a performance optimization or caching layer, persisting the data means you can prewarm a new Redis node, which avoids the thundering herd issue that we discussed earlier. To create a new Redis cluster from a backup snapshot, refer to Seeding a new cluster with an externally created backup in the Amazon ElastiCache for Redis User Guide.

You can also use a Redis snapshot to scale up to a larger Amazon EC2 instance type. To do so, follow this process:

  1. Suspend writes to your existing ElastiCache cluster. Your application can continue to do reads.

  2. Take a snapshot by following the procedure in the Making manual backups section in the Amazon ElastiCache for Redis User Guide. Give it a distinctive name that you will remember.

  3. Create a new ElastiCache Redis cluster, and specify the snapshot you took preceding to seed it.

  4. Once the new ElastiCache cluster is online, reconfigure your application to start writing to the new cluster.

Currently, this process will interrupt your application's ability to write data into Redis. If you have writes that are only going into Redis and that cannot be suspended, you can put those into Amazon SQS while you are resizing your ElastiCache cluster. Then, once your new ElastiCache Redis cluster is ready, you can run a script that pulls those records off Amazon SQS and writes them to your new Redis cluster.