Failover and Failback overview - AWS Elastic Disaster Recovery

Failover and Failback overview

In the event of a disaster, you will need to perform a Failover to AWS with the help of Elastic Disaster Recovery. Once the disaster has been mitigated, you will then need to perform a Failback to your original source infrastructure.

Elastic Disaster Recovery ensures that your recovery systems are ready in the case of a disaster. The actual failover is a networking operation that is performed outside of Elastic Disaster Recovery. You launch your recovery instances with Elastic Disaster Recovery, up to the latest second, or to a certain PIT. Once your systems are up and running, Once you are ready to resume operations on your primary system, you will need to perform failback replication. Most likely while you are using your recovery system on AWS, new data has been written and this data needs to be copied back to your primary system.

Elastic Disaster Recovery helps you be ready for a failover event by making the running of drills easy. Elastic Disaster Recovery allows you to perform frequent launching of you instances for test and drill purposes without redirecting the traffic.

In order to be prepared for a Failover, you need to perform continuous drills by launching Drill instances in AWS through Elastic Disaster Recovery and testing these instances.

Performing drills is a key aspect of being prepared for a disaster. Once the actual disaster strikes, you can immediately perform a Failover by launch Recovery instances in AWS based on a chosen Point In Time snapshot.

Once the disaster is over, you can perform a Failback to your original source server or to any other server that meets the prerequisites by installing the Elastic Disaster Recovery Failback Client on the server. In order to use the Failback Client, you need to generate Elastic Disaster Recovery-specific credentials.

Once your Failback is complete, you can opt to either terminate, delete, or disconnect the Recovery instance.

The following is the architectural diagram for DRS failback replication:

Understanding Drill and Recovery instances

Elastic Disaster Recovery allows you to launch Drill and Recovery instances for your source servers in AWS. Drill and Recovery instances are launched in a similar fashion. You can launch a Drill or Recovery instance from the most up-to-date state, typically achieving an RPO of seconds, or from one of the point-in-time states that the system maintains.

Understanding Point In Time states

Point in Time (PIT) is a disaster recovery feature which allows launching an instance from a snapshot captured at a specific Point In Time. As source servers are replicated, Point in Time states are chronicled over time, while a retention policy will determine which Points in Time are not required after a defined duration.

Elastic Disaster Recovery has the following PIT state schedule:

  • Every 10 minutes for the last hour

  • Once an hour for the last 24 hours

  • Once a day for the last 7 days (or a different retention period, as configured)

You can increase or decrease the default 7 day snapshot retention rate from anywhere between 1 day and 365 days in the Replication Settings. Learn more about managing Point in Time retention.

Note

Increasing the PIT retention rate will result in additional costs.

Upon launching Drill instances and Recovery instances, you will be prompted to select the Point in Time from which to launch the instances for the selected source servers.

When launching two or more source servers simultaneously, you can select which specific Points in Time to view.

Choose the Any option to view all of the points in time available for all of the selected source servers. Elastic Disaster Recovery will launch a drill instance for each source server that has a PIT snapshot taken at the chosen time. For any source server that does not have a corresponding PIT snapshot taken at the chosen time, a previous PIT will be used.

Choose the All option to only view points in time that include all of the selected servers. If there are no points in time that include all servers, the list will be empty.

The Servers in PIT and Servers to be recovered columns show the number of servers within the chosen PIT.

Under the Taken at time (UTC) column, you can either select the Use most recent data option, which will immediately create a new Point in Time and use that state, or you can select a previously taken snapshot from the list of available PIT states.

Note

If you selected the Use most recent data option but Elastic Disaster Recovery is unable to take a new PIT snapshot of the source server (due to a disaster, connectivity issues, etc) then Elastic Disaster Recovery will automatically use the last PIT taken.

Understanding Recovery Objectives

The DRS service provides continuous block-level replication, recovery orchestration, and automated server conversion capabilities. These allow customers to achieve a crash-consistent recovery point objective (RPO) of seconds, and a recovery time objective (RTO) typically ranging between 5-20 minutes. Below is an explanation of how RPO and RTO are measured, how DRS enables these RPOs and RTOs, and what common environment conditions can impact RPO and RTO.

Recovery Point Objective (RPO)

How is RPO measured?

RPO is measured based on the latest point in time in which block data was written to the source server volume(s) and successfully copied in a crash-consistent state into the replication Staging Area located in the customer’s target AWS account.

How does DRS enable an RPO of seconds?

The AWS Replication Agent continuously monitors the blocks written to the source server volume(s), and immediately attempts to copy the blocks across the network and into the replication Staging Area Subnet located in the customer’s target AWS account. This continuous replication approach enables an RPO of seconds as long as the written data can be immediately copied across the network and into the replication Staging Area volumes.

Important

A crash-consistent recovery point enables the successful recovery of crash-consistent applications, such as databases. The recovery point will include any data that has been successfully written to the source server volume(s). Application data that is kept in memory is not replicated to the target replication Staging Area until it is written to the source server volume(s). Therefore, if a disruption occurs before in-memory application data is written to the volume(s), this data will not be available on the target server when launched for test or recovery purposes.

What environment conditions can impact the ability to achieve a typical RPO of seconds?

To achieve an RPO of seconds, DRS primarily requires that the outbound network, inbound network, and Staging Area resources must allow data to be copied across the network and written to the target environment faster than the rate at which it is written to the source volume(s). In the case that block writes burst at faster rates than these components can support, the RPO will temporarily increase until the data replication can catch up, at which point the RPO will return to seconds. Examples:

  1. Outbound network: If a source server writes block data at a rate of 10MB/second, the outbound network bandwidth must also support a rate of at least 10MB/second in order to maintain a seconds RPO. If the source network contains 10 servers that each write at an average rate of 10MB/second, the total bandwidth will need to support a rate of at least 100MB/second in order to enable a seconds RPO.

  2. Inbound network: Once the replicated data is sent from the source network, it must enter the target network at a rate greater to that at which the data is written to the source servers and sent from the source network in order to maintain a seconds RPO.

  3. Staging Area resources: When the data arrives to the target network, it is received by the DRS Replication Server instance(s), which in turn writes the replicated data to attached EBS volumes. Both the Replication Server instance(s) and attached EBS volumes must allow the data to be written at a rate faster than that at which it is written to the source servers and sent by the source network in order to maintain an RPO of seconds.

What happens if the block data written to the source volume(s) cannot be sent immediately to the target replication Staging Area Subnet?

If the block data written on the source volume(s) cannot be sent immediately to the target replication Staging Area, the RPO will increase until the data can be flushed across the network. During this time, you will still be able to recover your server(s), but to a recovery point older than seconds, in accordance with the increase in RPO. The RPO represents the latest crash-consistent point in time during which data was replicated.

Recovery Time Objective (RTO)

How is RTO measured?

RTO is measured from the recovery job start time until the recovered target server is booted and has network access on AWS.

What environment conditions can impact the ability to achieve a typical RTO of 5-20 minutes?

A: When launching a recovery job, the DRS orchestration process creates cloned volumes by using the replicated volumes in the replication Staging Area. During this process, DRS also initiates a process that converts all volumes that originated outside of AWS into AWS-compatible volumes, which are attached to EC2 instances that can boot natively on AWS. The job and boot time depend on the following environment conditions:

  1. OS type: The average recovered Linux server normally boots within 5 minutes, while the average recovered Windows server normally boots within 20 minutes because it is tied to the more resource-intensive Windows boot process.

  2. OS configuration: The OS configuration and application components it runs can impact the boot time. For example, some servers run heavier workloads and start additional services when booted, which may increase their total boot time.

  3. Target instance performance: DRS sets a default instance type based on the CPU and RAM provisioned on the source server. Changing to a lower performance instance type will result in a slower boot time than that of a higher performance instance type.

  4. Target volume performance: Using a lower performance volume type will result in a slower boot time than that of a higher performance volume type with more provisioned IOPS.