Disaster recovery options in the cloud
Disaster recovery strategies available to you within AWS can be broadly categorized into four approaches, ranging from the low cost and low complexity of making backups to more complex strategies using multiple active Regions. Active/passive strategies use an active site (such as an AWS Region) to host the workload and serve traffic. The passive site (such as a different AWS Region) is used for recovery. The passive site does not actively serve traffic until a failover event is triggered.
It is critical to regularly assess and test your disaster recovery strategy so that you
have confidence in invoking it, should it become necessary. Use AWS Resilience Hub
![Graph showing disaster recovery strategies and highlights of each strategy.](/images/whitepapers/latest/disaster-recovery-workloads-on-aws/images/disaster-recovery-strategies.png)
Figure 6 - Disaster recovery strategies
For a disaster event based on disruption or loss of one physical
data center for a
well-architected
When choosing your strategy, and the AWS resources to implement it, keep in mind that within AWS, we commonly divide services into the data plane and the control plane. The data plane is responsible for delivering real-time service while control planes are used to configure the environment. For maximum resiliency, you should use only data plane operations as part of your failover operation. This is because the data planes typically have higher availability design goals than the control planes.
Backup and restore
Backup and restore is a suitable approach for mitigating against data loss or corruption.
This approach can also be used to mitigate against a regional disaster by replicating data to
other AWS Regions, or to mitigate lack of redundancy for workloads deployed to a single
Availability Zone. In addition to data, you must redeploy the infrastructure, configuration,
and application code in the recovery Region. To enable infrastructure to be redeployed quickly
without errors, you should always deploy using infrastructure as code (IaC) using services
such as AWS CloudFormation
![Architecture diagram showing backup and restore architecture](/images/whitepapers/latest/disaster-recovery-workloads-on-aws/images/backup-restore-architecture.png)
Figure 7 - Backup and restore architecture
AWS services
Your workload data will require a backup strategy that runs periodically or is continuous. How often you run your backup will determine your achievable recovery point (which should align to meet your RPO). The backup should also offer a way to restore it to the point in time in which it was taken. Backup with point-in-time recovery is available through the following services and resources:
-
Amazon EFS backup (when using AWS Backup)
-
Amazon FSx for Windows File Server, Amazon FSx for Lustre, Amazon FSx for NetApp ONTAP, and Amazon FSx for OpenZFS
For Amazon Simple Storage Service (Amazon S3), you can use
Amazon S3 Cross-Region Replication (CRR)
AWS Backup
-
Amazon EC2
instances -
Amazon Relational Database Service (Amazon RDS)
databases (including Amazon Aurora databases) -
Amazon DynamoDB
tables -
Amazon Elastic File System (Amazon EFS)
file systems -
AWS Storage Gateway
volumes -
Amazon FSx for Windows File Server, Amazon FSx for Lustre, Amazon FSx for NetApp ONTAP, and Amazon FSx for OpenZFS
AWS Backup supports copying backups across Regions, such as to a disaster recovery Region.
As an additional disaster recovery strategy for your Amazon S3 data, enable S3 object versioning. Object versioning protects your data in S3 from the consequences of deletion or modification actions by retaining the original version before the action. Object versioning can be a useful mitigation for human-error type disasters. If you are using S3 replication to back up data to your DR region, then, by default, when an object is deleted in the source bucket, Amazon S3 adds a delete marker in the source bucket only. This approach protects data in the DR Region from malicious deletions in the source Region.
In addition to data, you must also back up the configuration and
infrastructure necessary to redeploy your workload and meet your
Recovery Time Objective (RTO).
AWS CloudFormation
Any data stored in the disaster recovery Region as backups must be restored at time of
failover. AWS Backup offers restore capability, but does not currently enable scheduled or
automatic restoration. You can implement automatic restore to the DR region using the AWS
SDK to call APIs for AWS Backup. You can set this up as a regularly recurring job or trigger
restoration whenever a backup is completed. The following figure shows an example of
automatic restoration using Amazon Simple Notification Service (Amazon SNS)
![Diagram showing workflow of restoring and testing backups.](/images/whitepapers/latest/disaster-recovery-workloads-on-aws/images/restore-test-backups.png)
Figure 8 - Restoring and testing backups
Note
Your backup strategy must include testing your backups. See the Testing Disaster Recovery section for more
information. Refer to the AWS Well-Architected Lab: Testing Backup and Restore of Data
Pilot light
With the pilot light approach, you replicate your data from one Region to another and provision a copy of your core workload infrastructure. Resources required to support data replication and backup, such as databases and object storage, are always on. Other elements, such as application servers, are loaded with application code and configurations, but are "switched off" and are only used during testing or when disaster recovery failover is invoked. In the cloud, you have the flexibility to deprovision resources when you do not need them, and provision them when you do. A best practice for “switched off” is to not deploy the resource, and then create the configuration and capabilities to deploy it (“switch on”) when needed. Unlike the backup and restore approach, your core infrastructure is always available and you always have the option to quickly provision a full scale production environment by switching on and scaling out your application servers.
![Reference architecture diagram for pilot light architecture](/images/whitepapers/latest/disaster-recovery-workloads-on-aws/images/pilot-light-architecture.png)
Figure 9 - Pilot light architecture
A pilot light approach minimizes the ongoing cost of disaster recovery by minimizing the active resources, and simplifies recovery at the time of a disaster because the core infrastructure requirements are all in place. This recovery option requires you to change your deployment approach. You need to make core infrastructure changes to each Region and deploy workload (configuration, code) changes simultaneously to each Region. This step can be simplified by automating your deployments and using infrastructure as code (IaC) to deploy infrastructure across multiple accounts and Regions (full infrastructure deployment to the primary Region and scaled down/switched-off infrastructure deployment to DR regions). It is recommended you use a different account per Region to provide the highest level of resource and security isolation (in the case compromised credentials are part of your disaster recovery plans as well).
With this approach, you must also mitigate against a data disaster. Continuous data replication protects you against some types of disaster, but it may not protect you against data corruption or destruction unless your strategy also includes versioning of stored data or options for point-in-time recovery. You can back up the replicated data in the disaster Region to create point-in-time backups in that same Region.
AWS services
In addition to using the AWS services covered in the Backup and Restore section to create point-in-time backups, also consider the following services for your pilot light strategy.
For pilot light, continuous data replication to live databases and data stores in the DR region is the best approach for low RPO (when used in addition to the point-in-time backups discussed previously). AWS provides continuous, cross-region, asynchronous data replication for data using the following services and resources:
With continuous replication, versions of your data are available almost immediately in your DR Region. Actual replication times can be monitored using service features like S3 Replication Time Control (S3 RTC) for S3 objects and management features of Amazon Aurora global databases.
When failing over to run your read/write workload from the disaster recovery Region, you must promote an RDS read replica to become the primary instance. For DB instances other than Aurora, the process takes a few minutes to complete and rebooting is part of the process. For Cross-Region Replication (CRR) and failover with RDS, using Amazon Aurora global database provides several advantages. Global database uses dedicated infrastructure that leaves your databases entirely available to serve your application, and can replicate to the secondary Region with typical latency of under a second (and within an AWS Region is much less than 100 milliseconds). With Amazon Aurora global database, if your primary Region suffers a performance degradation or outage, you can promote one of the secondary regions to take read/write responsibilities in less than one minute even in the event of a complete regional outage. You can also configure Aurora to monitor the RPO lag time of all secondary clusters to make sure that at least one secondary cluster stays within your target RPO window.
A scaled down version of your core workload infrastructure with fewer or smaller
resources must be deployed in your DR Region. Using AWS CloudFormation, you can define your
infrastructure and deploy it consistently across AWS accounts and across AWS Regions.
AWS CloudFormation uses predefined pseudo
parameters to identify the AWS account and AWS Region in which it is deployed.
Therefore, you can implement condition logic
in your CloudFormation templates to deploy only the scaled-down version of your
infrastructure in the DR Region. For EC2 instance deployments, an Amazon Machine Image (AMI)
supplies information such as hardware configuration and installed software. You can
implement an Image Builder
pipeline that creates the AMIs you need and copy these to both your primary and backup
Regions. This helps to ensure that these golden AMIs have everything
you need to re-deploy or scale-out your workload in a new region, in case of a disaster
event. Amazon EC2 instances are deployed in a scaled-down configuration (less instances than in
your primary Region).
To scale-out the infrastructure to support production traffic, see Amazon EC2 Auto Scaling
For an active/passive configuration such as pilot light, all traffic initially goes to the primary Region and switches to the disaster recovery Region if the primary Region is no longer available. This failover operation can be initiated either automatically or manually. Automatically initiated failover based on health checks or alarms should be used with caution. Even using the best practices discussed here, recovery time and recovery point will be greater than zero, incurring some loss of availability and data. If you fail over when you don’t need to (false alarm), then you incur those losses. Manually initiated failover is therefore often used. In this case, you should still automate the steps for failover, so that the manual initiation is like the push of a button.
There are several traffic management options to consider when using AWS services.
One option is to use Amazon Route 53
Another option is to use AWS Global Accelerator
Amazon CloudFront
AWS Elastic Disaster Recovery
AWS Elastic Disaster Recovery
![Architecture diagram showing AWS Elastic Disaster Recovery architecture.](/images/whitepapers/latest/disaster-recovery-workloads-on-aws/images/disaster-recovery-architecture.png)
Figure 10 - AWS Elastic Disaster Recovery architecture
Warm standby
The warm standby approach involves ensuring that there is a scaled down, but fully functional, copy of your production environment in another Region. This approach extends the pilot light concept and decreases the time to recovery because your workload is always-on in another Region. This approach also allows you to more easily perform testing or implement continuous testing to increase confidence in your ability to recover from a disaster.
![Architecture diagram showing warm standby architecture.](/images/whitepapers/latest/disaster-recovery-workloads-on-aws/images/warm-standby-architecture.png)
Figure 11 - Warm standby architecture
Note: The difference between pilot light and warm standby can sometimes be difficult to understand. Both include an environment in your DR Region with copies of your primary Region assets. The distinction is that pilot light cannot process requests without additional action taken first, whereas warm standby can handle traffic (at reduced capacity levels) immediately. The pilot light approach requires you to “turn on” servers, possibly deploy additional (non-core) infrastructure, and scale up, whereas warm standby only requires you to scale up (everything is already deployed and running). Use your RTO and RPO needs to help you choose between these approaches.
AWS services
All of the AWS services covered under backup and restore and pilot light are also used in warm standby for data backup, data replication, active/passive traffic routing, and deployment of infrastructure including EC2 instances.
Amazon EC2 Auto Scaling
Because Auto Scaling is a control plane activity, taking a dependency on it will lower the resiliency of your overall recovery strategy. It is a trade-off. You can choose to provision sufficient capacity such that the recovery Region can handle the full production load as deployed. This statically stable configuration is called hot standby (see the next section). Or you may choose to provision fewer resources which will cost less, but take a dependency on Auto Scaling. Some DR implementations will deploy enough resources to handle initial traffic, ensuring low RTO, and then rely on Auto Scaling to ramp up for subsequent traffic.
Multi-site active/active
You can run your workload simultaneously in multiple Regions as part of a multi-site active/active or hot standby active/passive strategy. Multi-site active/active serves traffic from all regions to which it is deployed, whereas hot standby serves traffic only from a single region, and the other Region(s) are only used for disaster recovery. With a multi-site active/active approach, users are able to access your workload in any of the Regions in which it is deployed. This approach is the most complex and costly approach to disaster recovery, but it can reduce your recovery time to near zero for most disasters with the correct technology choices and implementation (however data corruption may need to rely on backups, which usually results in a non-zero recovery point). Hot standby uses an active/passive configuration where users are only directed to a single region and DR regions do not take traffic. Most customers find that if they are going to stand up a full environment in the second Region, it makes sense to use it active/active. Alternatively, if you do not want to use both Regions to handle user traffic, then Warm Standby offers a more economical and operationally less complex approach.
![Architecture diagram showing multi-site active/active architecture (change one Active path to Inactive for hot standby)](/images/whitepapers/latest/disaster-recovery-workloads-on-aws/images/multi-site-active-active-architecture.png)
Figure 12 - Multi-site active/active architecture (change one Active path to Inactive for hot standby)
With multi-site active/active, because the workload is running in more than one Region, there is no such thing as failover in this scenario. Disaster recovery testing in this case would focus on how the workload reacts to loss of a Region: Is traffic routed away from the failed Region? Can the other Region(s) handle all the traffic? Testing for a data disaster is also required. Backup and recovery are still required and should be tested regularly. It should also be noted that recovery times for a data disaster involving data corruption, deletion, or obfuscation will always be greater than zero and the recovery point will always be at some point before the disaster was discovered. If the additional complexity and cost of a multi-site active/active (or hot standby) approach is required to maintain near zero recovery times, then additional efforts should be made to maintain security and to prevent human error to mitigate against human disasters.
AWS services
All of the AWS services covered under backup and restore, pilot light, and warm standby also are used here for point-in-time data backup, data replication, active/active traffic routing, and deployment and scaling of infrastructure including EC2 instances.
For the active/passive scenarios discussed earlier (Pilot Light and Warm Standby), both Amazon Route 53 and AWS Global Accelerator can be used for route network traffic to the active region. For the active/active strategy here, both of these services also enable the definition of policies that determine which users go to which active regional endpoint. With AWS Global Accelerator you set a traffic dial to control the percentage of traffic that is directed to each application endpoint. Amazon Route 53 supports this percentage approach, and also multiple other available policies including geoproximity and latency based ones. Global Accelerator automatically leverages the extensive network of AWS edge servers, to onboard traffic to the AWS network backbone as soon as possible, resulting in lower request latencies.
Asynchronous data replication with this strategy enables near-zero RPO. AWS services like Amazon Aurora global database use dedicated infrastructure that leaves your databases entirely available to serve your application, and can replicate to up to five secondary Region with typical latency of under a second. With active/passive strategies, writes occur only to the primary Region. The difference with active/active is designing how data consistency with writes to each active Region are handled. It is common to design user reads to be served from the Region closest to them, known as read local. With writes, you have several options:
-
A write global strategy routes all writes to a single Region. In case of failure of that Region, another Region would be promoted to accept writes. Aurora global database is a good fit for write global, as it supports synchronization with read-replicas across Regions, and you can promote one of the secondary Regions to take read/write responsibilities in less than one minute. Aurora also supports write forwarding, which lets secondary clusters in an Aurora global database forward SQL statements that perform write operations to the primary cluster.
-
A write local strategy routes writes to the closest Region (just like reads). Amazon DynamoDB global tables enables such a strategy, allowing read and writes from every region your global table is deployed to. Amazon DynamoDB global tables use a last writer wins reconciliation between concurrent updates.
-
A write partitioned strategy assigns writes to a specific Region based on a partition key (like user ID) to avoid write conflicts. Amazon S3 replication configured bi-directionally
can be used for this case, and currently supports replication between two Regions. When implementing this approach, make sure to enable replica modification sync on both buckets A and B to replicate replica metadata changes like object access control lists (ACLs), object tags, or object locks on the replicated objects. You can also configure whether or not to replicate delete markers between buckets in your active Regions. In addition to replication, your strategy must also include point-in-time backups to protect against data corruption or destruction events.
AWS CloudFormation is a powerful tool to enforce consistently
deployed infrastructure among AWS accounts in multiple AWS
Regions.
AWS CloudFormation StackSets extends this functionality by
enabling you to create, update, or delete CloudFormation stacks
across multiple accounts and Regions with a single operation.
Although AWS CloudFormation uses YAML or JSON to define
Infrastructure as Code,
AWS Cloud Development Kit (AWS CDK)