Introduction - Disaster Recovery of Workloads on AWS: Recovery in the Cloud

Introduction

Your workload must perform its intended function correctly and consistently. To achieve this, you must architect for resiliency. Resiliency is the ability of a workload to recover from infrastructure or service disruptions, dynamically acquire computing resources to meet demand, and mitigate disruptions, such as misconfigurations or transient network issues.

Disaster recovery (DR) is an important part of your resiliency strategy and concerns how your workload responds when a disaster strikes (a disaster is an event that causes a serious negative impact on your business). This response must be based on your organization's business objectives which specify your workload's strategy for avoiding loss of data, known as the Recovery Point Objective (RPO), and reducing downtime where your workload is not available for use, known as the Recovery Time Objective (RTO). You must therefore implement resilience in the design of your workloads in the cloud to meet your recovery objectives (RPO and RTO) for a given one-time disaster event. This approach helps your organization to maintain business continuity as part of Business Continuity Planning (BCP).

This paper focuses on how to plan for, design, and implement architectures on AWS that meet the disaster recovery objectives for your business. The information shared here is intended for those in technology roles, such as chief technology officers (CTOs), architects, developers, and operations team members.

Disaster recovery and availability

Disaster recovery can be compared to availability, which is another important component of your resiliency strategy. Whereas disaster recovery measures objectives for one-time events, availability objectives measure mean values over a period of time.


        Image showing resiliency objectives for disaster recovery (RTO, RPO) and
          Availability (MTBF, MTTR).

Figure 1 - Resiliency Objectives

Availability is calculated using Mean Time Between Failures (MTBF) and Mean Time to Recover (MTTR):


        Availability equals Available for Use Time divided by Total Time equals MTBF divided
          by MTBF plus MTTR.

This approach is often referred to as “nines”, where a 99.9% availability target is referred to as “three nines”.

For your workload, it may be easier to count successful and failed requests instead of using a time-based approach. In this case, the following calculation can be used:


        Availability equals successful responses divided by valid requests.

Disaster recovery focuses on disaster events, whereas availability focuses on more common disruptions of smaller scale such as component failures, network issues, and load spikes. The objective of disaster recovery is business continuity, whereas availability concerns maximizing the time that a workload is available to perform its intended business functionality. Both should be part of your resiliency strategy.