Menu
Amazon ElastiCache
User Guide (API Version 2015-02-02)

Replication: Multi-AZ with Automatic Failover (Redis)

Enabling Amazon ElastiCache's Multi-AZ with Automatic Failover functionality on your Redis cluster (API/CLI: replication group) improves your fault tolerance in those cases where your cluster's read/write primary cluster becomes unreachable or fails for any reason.

Automatic Failover Overview

An ElastiCache cluster consists of a primary node and up to five read replica nodes. During certain types of planned maintenance, or in the unlikely event of a primary node or Availability Zone failure, if your cluster is Multi-AZ enabled, ElastiCache will automatically detect the primary node's failure, select a read replica node and promote it to primary so that you can resume writing to the new primary as soon as promotion is complete. ElastiCache also propagates the DNS of the promoted replica so that if your application is writing to the primary endpoint, no endpoint change will be required in your application. However, because you read from individual endpoints, you will need to change the read endpoint of the replica promoted to primary to the new replica's endpoint.

The promotion process generally takes just a few minutes, which is much faster than recreating and provisioning a new primary if you do not enable Multi-AZ.

You can enable Multi-AZ with Automatic Failover using the ElastiCache console, the AWS CLI, or the ElastiCache API.

Notes on Redis Multi-AZ with Automatic Failover

The following points should be noted:

  • Multi-AZ with Automatic Failover is supported on Redis version 2.8.6 and later.

  • Redis Multi-AZ with Automatic Failover is not supported on t1 and t2 cache node types.

  • Redis replication is asynchronous. Therefore, when a primary cluster fails over to a replica, a small amount of data might be lost due to replication lag.

  • When selecting the replica to promote to primary, ElastiCache selects the replica with the least replication lag (that is, the one that is most current).

  • When you enable Multi-AZ with Automatic Failover on a cluster, a replica node cannot be manually promoted to primary cluster. Thus, if the primary in AZ-a fails over to a replica in AZ-b, the primary stays in AZ-b. To promote the new replica cluster in AZ-a to primary, you must first disable Multi-AZ with Automatic Failover on the cluster, do the promotion, and then re-enable Multi-AZ with Automatic Failover.

  • ElastiCache Multi-AZ with Automatic Failover and append-only file (AOF) are mutually exclusive. If you enable one, you cannot enable the other.

  • In the case where a node's failure is caused by the rare event of an entire Availability Zone failing, the replica replacing the failed primary is created only when the Availability Zone is back up. For example, consider a replication group with the primary in AZ-a and replicas in AZ-b and AZ-c. If the primary fails, the replica with the least replication lag is promoted to primary cluster. Then, ElastiCache creates a new replica in AZ-a (where the failed primary was located) only when AZ-a is back up and available.

  • A customer-initiated reboot of a primary does not trigger Automatic Failover. Other reboots and failures do trigger Automatic Failover.

  • Whenever the primary is rebooted, it is cleared of data when it comes back online. When the read replicas see the cleared primary cluster, they clear their copy of the data, which causes data loss.

  • After a read replica has been promoted, the other replicas sync with the new primary. After the initial sync, the replicas' content is deleted and they sync the data from the new primary, causing a brief interruption during which the replicas are not accessible. This sync process also causes a temporary load increase on the primary while syncing with the replicas. This behavior is native to Redis and isn’t unique to ElastiCache Multi-AZ. For details regarding this Redis behavior, see http://redis.io/topics/replication.

Important

  • Redis version 2.8.22 and later

    External replicas are not permitted.

     

  • Redis versions prior to 2.8.22

    We recommend that you do not connect an external Redis replica to an ElastiCache Redis cluster that is Multi-AZ with Automatic Failover enabled. This is an unsupported configuration that can create issues that prevent ElastiCache from properly performing failover and recovery. If you need to connect an external Redis replica to an ElastiCache cluster, make sure that Multi-AZ with Automatic Failover is disabled before you make the connection.

Failure Scenarios with Multi-AZ and Automatic Failover Responses

Prior to the introduction of Multi-AZ with Automatic Failover, ElastiCache detected and replaced a cluster's failed nodes by recreating and re-provisioning the failed node. By enabling Multi-AZ with Automatic Failover, a failed primary node fails over to the replica with the least replication lag. The selected replica is automatically promoted to primary, which is much faster than creating and reprovisioning a new primary node. This process usually takes just a few minutes until you can write to the cluster again.

When Multi-AZ with Automatic Failover is enabled, ElastiCache continually monitors the state of the primary node. If the primary node fails, one of the following actions is performed.

 

When only the primary node fails

If only the primary node fails, the read replica with the least replication lag is promoted to primary, and a replacement read replica is created and provisioned in the same Availability Zone as the failed primary.

Image: Automatic Failover for a failed primary node

Automatic Failover for a failed primary node

ElastiCache Multi-AZ with Automatic Failover Actions when only the primary node fails

  1. The failed primary node is taken off line.

  2. The read replica with the least replication lag is promoted to primary.

    Writes can resume as soon as the promotion process is complete, typically just a few minutes. If your application is writing to the Primary Endpoint, there is no need to change the endpoint for writes as ElastiCache propagates the DNS of the promoted replica.

  3. A replacement read replica is launched and provisioned.

    The replacement read replica is launched in the Availability Zone that the failed primary node was in so that the distribution of nodes is maintained.

  4. The replicas sync with the new primary node.

You need to make the following changes to your application after the new replica is available:

  • Primary endpoint–Do not make any changes to your application since the DNS of the new primary node is propagated to the primary endpoint.

  • Read endpoint–Replace the read endpoint of the failed primary with the read endpoint of the new replica.

 

When the primary node and some read replicas fail

If the primary and at least one read-replica fails, the available replica with the least replication lag is promoted to primary cluster and new read replicas are created and provisioned in the same Availability Zones as the failed nodes and replica that was promoted to primary.

Image: Automatic Failover for a failed primary node and read replica

Automatic Failover for a failed primary node and read replica

ElastiCache Multi-AZ Actions when the primary node and some read replicas fail

  1. The failed primary node and failed read replicas are taken off line.

  2. The available replica with the least replication lag is promoted to primary node.

    Writes can resume as soon as the promotion process is complete, typically just a few minutes. If your application is writing to the Primary Endpoint, there is no need to change the endpoint for writes as ElastiCache propagates the DNS of the promoted replica.

  3. Replacement replicas are created and provisioned.

    The replacement replicas are created in the Availability Zones of the failed nodes so that the distribution of nodes is maintained.

  4. All clusters sync with the new primary node.

You need to make the following changes to your application after the new nodes are available:

  • Primary endpoint–Do not make any changes to your application since the DNS of the new primary node is propagated to the primary endpoint.

  • Read endpoint–Replace the read endpoint of the failed primary and failed replicas with the node endpoints of the new replicas.

 

When the entire cluster fails

If everything fails, all the nodes are recreated and provisioned in the same availability zones as the original nodes.

In this scenario, all the data in the cluster is lost due to the failure of every node in the cluster. This is a rare occurrence.

Image: Automatic Failover for a failed cluster

Automatic Failover for a failed cluster

ElastiCache Multi-AZ Actions when the entire cluster fails

  1. The failed primary node and read replicas are taken off line.

  2. A replacement primary node is created and provisioned.

  3. Replacement replicas are created and provisioned.

    The replacements are created in the Availability Zones of the failed nodes so that the distribution of nodes is maintained.

    Note

    Because the entire cluster failed, data is lost and all the new nodes start cold.

Because each of the replacement nodes will have the same endpoint as the node it is replacing, there is no need for you to make any endpoint changes in your application.

We recommend that you create the primary node and read replicas in different Availability Zones to raise your fault tolerance level.

Enabling Multi-AZ with Automatic Failover

You can enable Multi-AZ with Automatic Failover when you create or modify a cluster (API/CLI: replication group) using the AWS console, AWS CLI, or the ElastiCache API.

Multi-AZ with Automatic Failover can only be enabled on Redis clusters that have at least one available read replica. For information about creating a cluster with replication, see Creating a Redis Cluster with Replicas. For information about adding a read replica to a cluster with replication, see Adding a Read Replica to a Redis Cluster.

Enabling Multi-AZ with Automatic Failover (Console)

You can enable Multi-AZ with Automatic Failover using the ElastiCache console when you create a new Redis cluster or by modifying an existing Redis cluster with replication.

Enabling Multi-AZ with Automatic Failover When Creating a Cluster Using the ElastiCache Console

See the topic Creating a Redis (cluster mode disabled) Cluster (Console). Be sure to have one or more replicas and enable Multi-AZ with Automatic Failover.

Enabling Multi-AZ with Automatic Failover on an Existing Cluster (Console)

See the topic Modifying a Cluster (Console).

Enabling Multi-AZ with Automatic Failover (AWS CLI)

The following code example uses the AWS CLI to enable Multi-AZ with Automatic Failover for the replication group myReplGroup.

Important

The replication group myReplGroup must already exist and have at least one available read replica.

For Linux, macOS, or Unix:

Copy
aws elasticache modify-replication-group \ --replication-group-id myReplGroup \ --automatic-failover-enabled

For Windows:

Copy
aws elasticache modify-replication-group ^ --replication-group-id myReplGroup ^ --automatic-failover-enabled

For more information, see the AWS CLI topics, create-cache-cluster, create-replication-group, and modify-replication-group.

Enabling Multi-AZ with Automatic Failover (ElastiCache API)

The following code example uses the ElastiCache API to enable Multi-AZ with Automatic Failover for the replication group myReplGroup.

Note

The replication group myReplGroup must already exist and have at least one available read replica.

Copy
https://elasticache.us-west-2.amazonaws.com/ ?Action=ModifyReplicationGroup &AutoFailover=true &ReplicationGroupId=myReplGroup &Version=2015-02-02 &SignatureVersion=4 &SignatureMethod=HmacSHA256 &Timestamp=20140401T192317Z &X-Amz-Credential=<credential>

For more information, see the ElastiCache API reference for CreateCacheCluster, CreateReplicationGroup, and ModifyReplicationGroup.