Menu
Amazon ElastiCache
User Guide (API Version 2015-02-02)

Mitigating Failures

When planning your Amazon ElastiCache implementation, you should plan so that failures have a minimal impact upon your application and data. The topics in this section cover approaches you can take to protect your application and data from failures.

Mitigating Failures when Running Memcached

When running the Memcached engine, you have the following options for minimizing the impact of a failure. There are two types of failures to address in your failure mitigation plans: node failure and availability zone failure.

Mitigating Node Failures

To mitigate the impact of a node failure, spread your cached data over more nodes. Because Memcached does not support replication, a node failure will always result in some data loss from your cluster.

When you create your Memcached cluster you can create it with 1 to 20 nodes, or more by special request. Partitioning your data across a greater number of nodes means you'll lose less data if a node fails. For example, if you partition your data across 10 nodes, any single node stores approximately 10% of your cached data. In this case, a node failure loses approximately 10% of your cache which needs to be replaced when a replacement node is created and provisioned. If the same data were cached in 3 larger nodes, the failure of a node would lose approximately 33% of your cached data.

If you need more than 20 nodes in a Memcached cluster, or more than 100 nodes total in a region, please fill out the ElastiCache Limit Increase Request form at https://aws.amazon.com/contact-us/elasticache-node-limit-request/.

For information on specifying the number of nodes in a Memcached cluster, go to Creating a Cluster (Console): Memcached.

Mitigating Availability Zone Failures

To mitigate the impact of an availability zone failure, locate your nodes in as many availability zones as possible. In the unlikely event of an AZ failure, you will lose only the data cached in that AZ, not the data cached in the other AZs.

Why so many nodes?

If my region has only 3 availability zones, why do I need more than 3 nodes since if an AZ fails I lose approximately one-third of my data?

This is an excellent question. Remember that we’re attempting to mitigate two distinct types of failures, node and availability zone. You’re right, if your data is spread across availability zones and one of the zones fails, you will lose only the data cached in that AZ, irrespective of the number of nodes you have. However, if a node fails, having more nodes will reduce the proportion of cache data lost.

There is no "magic formula" for determining how many nodes to have in your cluster. You must weight the impact of data loss vs. the likelihood of a failure and come to your own conclusion.

For information on specifying the number of nodes in a Memcached cluster, go to Creating a Cluster (Console): Memcached.

For more information on regions and availability zones, go to Selecting Regions and Availability Zones.

Mitigating Failures when Running Redis

When running the Redis engine, you have the following options for minimizing the impact of a cluster or availability zone failure.

Mitigating Cluster Failures

To mitigate the impact of Redis cluster failures, you have the following options:

Mitigating Cluster Failures: Redis Append Only Files (AOF)

When AOF is enabled for Redis, whenever data is written to your Redis cluster, a corresponding transaction record is written to a Redis append only file (AOF). If your Redis process restarts, ElastiCache creates a replacement cluster and provisions it. You can then run the AOF against the cluster to repopulate it with data.

Some of the shortcomings of using Redis AOF to mitigate cluster failures are:

  • It is time consuming.

    Creating and provisioning a cluster can take several minutes. Depending upon the size of the AOF, running it against the cluster will add even more time during which your application cannot access your cluster for data, forcing it to hit the database directly.

     

  • The AOF can get big.

    Because every write to your cluster is written to a transaction record, AOFs can become very large, larger than the .rdb file for the dataset in question. Because ElastiCache relies on the local instance store, which is limited in size, enabling AOF can cause out-of-disk-space issues. You can avoid out-of-disk-space issues by using a replication group with Multi-AZ enabled.

     

  • Using AOF cannot protect you from all failure scenarios.

    For example, if a cluster fails due to a hardware fault in an underlying physical server, ElastiCache will provision a new cluster on a different server. In this case, the AOF is not available and cannot be used to recover the data, leaving Redis to start with a cold cache.

For more information, see Redis Append Only Files (AOF).

Mitigating Cluster Failures: Redis Replication Groups

A Redis replication group is comprised of a single primary cluster which your application can both read from and write to, and from 1 to 5 read-only replica clusters. Whenever data is written to the primary cluster it is also asynchronously updated on the read replica clusters.

When a read replica fails

  1. ElastiCache detects the failed read replica.

  2. ElastiCache takes the failed cluster off line.

  3. ElastiCache launches and provisions a replacement cluster in the same AZ.

  4. The new cluster synchronizes with the Primary cluster.

During this time your application can continue reading and writing using the other clusters.

Redis Multi-AZ with Automatic Failover

You can enable Multi-AZ with automatic failover on your Redis replication groups. Whether you enable Multi-AZ with auto failover or not, a failed Primary will be detected and replaced automatically. How this takes place varies whether or not Multi-AZ is or is not enabled.

When Multi-AZ with Auto Failover is enabled

  1. ElastiCache detects the Primary failure.

  2. ElastiCache promotes the read replica with the least replication lag to primary.

  3. The other replicas sync with the new primary.

  4. ElastiCache spins up a read replica in the failed primary's AZ.

  5. The new cluster syncs with the newly promoted primary.

Failing over to a replica cluster is generally faster than creating and provisioning a new cluster. This means your application can resume writing to your cluster sooner than if Multi-AZ were not enabled.

For more information, see Replication: Multi-AZ with Automatic Failover (Redis).

When Multi-AZ with Auto Failover is disabled

  1. ElastiCache detects Primary failure.

  2. ElastiCache takes the Primary offline.

  3. ElastiCache creates and provisions a new Primary node to replace the failed Primary.

  4. ElastiCache syncs the new Primary with one of the existing replicas.

  5. When the sync is finished, the new node functions as the cluster's Primary.

During this process, steps 1 through 4, your application cannot write to the Primary cluster. However, your application can continue reading from your replica clusters.

For added protection, we recommend that you launch the clusters in your replication group in different availability zones (AZs). If you do this, an AZ failure will only impact the clusters in that AZ and not the others.

For more information, see ElastiCache Replication (Redis).

Mitigating Availability Zone Failures

To mitigate the impact of an availability zone failure, locate your clusters in as many availability zones as possible.

No matter how many clusters you have, if they are all located in the same availability zone, a catastrophic failure of that AZ results in your losing all your cache data. However, if you locate your clusters in multiple AZs, a failure of any AZ results in your losing only the clusters in that AZ.

Any time you lose a cluster you can experience a performance degradation since read operations are now shared by fewer clusters. This performance degradation will continue until the clusters are replaced. Because your data is not partitioned across Redis clusters, you risk some data loss only when the primary cluster is lost.

For information on specifying the availability zones for Redis clusters, go to Creating a Redis (cluster mode disabled) Cluster (Console).

For more information on regions and availability zones, go to Selecting Regions and Availability Zones.

Recommendations

There are two types of failures you need to plan for, individual node or cluster failures and broad availability zone failures. The best failure mitigation plan will address both kinds of failures.

Minimizing the Impact of Node and Cluster Failures

To minimize the impact of a node or cluster failure, we recommend that your implementation use multiple nodes or clusters.

If you're running Memcached and partitioning your data across nodes, the more nodes you use the smaller the data loss if any one node fails.

If you’re running Redis, we also recommend that you enable Multi-AZ on your replication group so that ElastiCache will automatically fail over to a replica if the primary cluster fails.

Minimizing the Impact of Availability Zone Failures

To minimize the impact of an availability zone failure, we recommend launching your nodes or clusters in as many different availability zones as are available. Spreading your nodes or clusters evenly across AZs will minimize the impact in the unlikely event of an AZ failure.

Other precautions

If you're running Redis, then in addition to the above, we recommend that you schedule regular backups of your cluster. Backups (snapshots) create a .rdb file you can use to restore your cluster in case of failure or corruption. For more information, see ElastiCache Backup & Restore (Redis).