Solutions and methodologies - Disaster Recovery of On-Premises Applications to AWS

Solutions and methodologies

The following section provides an overview of common solutions and methodologies that are advertised as disaster recovery, and explains the difference between these solutions and actual disaster recovery.

Various resilience solutions compared

  • Backup – Backup protects against data loss by storing historical data so that if any data is lost, it can be recovered from the backup. Backup solutions can store historical data locally, in a remote location, or in both. The advantage for local backups is recovery speed, and for remote backups, the advantage is more resiliency. Backup solutions often have relatively low total cost of ownership (TCO), as the only infrastructure needed is storage, and the performance requirements for that storage are low (for example, some companies still use tape-based backup because of the low cost of tapes).

  • Archiving – A subcategory of backup solutions is archiving. Archives provide unchanged historical copies of data to meet legal and compliance requirements. Archives are normally kept for a longer term than standard backups. Unlike backup, which may provide quicker file restoration (normally measured in hours or days), archives are not utilized by routine business operations and can be stored in low-cost, off-site locations.

  • High availability (HA) – High availability enables an application to continue operating uninterrupted if a component of that application malfunctions. Detecting the malfunction and ensuring that the application continues to work as normal is almost always an automated process. Ideally, a user of the application would not experience anything unusual in case of such a failure. A typical example is a multi-node database.

    Most modern multi-node databases continue operation uninterrupted if a single component fails. High availability is normally introduced as part of the design and implementation of the system, as it is much harder to add to an existing application that was not designed with high availability in mind. High availability solutions ensure minimal impact on users (ideally, no impact) in case of issues. However, they are only meant to deal with a small localized event (for example, failure of a single server or subnet). A high availability solution will not be able to handle a wider disaster, such as the failure of an entire data center or a corrupted software update.

  • Disaster recovery – Disaster recovery helps ensure business continuity for applications in case of an issue that prevents the application from recovering automatically, or that requires a significant amount of time until recovery is achieved. Disaster recovery includes the ability to use a secondary application in a secondary location that will serve the application’s users until the original instance is fixed or recovered.

    Switching users to the secondary location is not an automatic process, but is instead performed on the basis of an explicit decision by an authorized person or group of people in the organization, because there are costs associated with it. For example, there is some downtime while the failover is commencing, and there is the cost of the labor for people participating in the switch. These implications need to be weighed against the chances of the source site returning to normal operation in a timely manner.

    Secondary location solutions usually have a higher TCO than backup, because the secondary site needs to be maintained at all times (during normal operation as well) and needs to be advanced enough to support the functionality of the application in case of a disaster.

These three resilience solutions are complementary of each other. Business requirements may dictate that workloads should apply a combination of these solutions, depending on the business resilience requirements of each application.

Disaster recovery compared to backup

There’s an important distinction between backup and disaster recovery. Backup is the process of making an extra copy (or multiple copies) of data. You back up data to be able to restore it in case it is lost or corrupted. You might need to restore backup data if you encounter an accidental deletion, database corruption, or problem with a software upgrade. It is important to have a backup solution in place. Backup protects your data in case of theft of equipment storing data, employee accidents (deletion of an important file), a technical issue (crashed hard drive), or malicious tampering (ransomware). With this protection, you can access a copy of your data and restore it easily.

Disaster recovery, on the other hand, refers to the plan and processes for quickly reestablishing access to applications, data, and IT resources after an outage. This plan might involve switching over to a redundant set of servers and storage systems until your source data center is functional again. For example, a disaster can lead to a disruption of your entire network, resulting in your employees being unable to work for the entire day (or even longer). However, a proper disaster recovery solution would allow your employees to continue to work using the mirrored system, while your IT team fixes the problem in the original network.

Some organizations mistake backup for disaster recovery. But as they may discover after a serious outage, simply having copies of data doesn’t mean you can keep your business running. To ensure business continuity, you need a robust, tested disaster recovery solution that enables maintaining normal operation until the disaster is resolved.

In terms of similarities, both backup and disaster recovery solutions maintain copies of historical data that may have changed in the source storage (often referred to as snapshots or point-in-time copies). In the case of backup solutions, this is a core part of the solution’s value: to be able to restore a previous version of data in case it was incorrectly modified or corrupted. In the case of disaster recovery solutions, this is done to enable successful recovery if the latest state of the data prevents normal operation. Database corruptions, ransomware data encryption, and incorrect software configuration all fall under this category and would require the disaster recovery site to be based on a previous version of the data.

However, when backup and disaster recovery are compared, there are multiple distinct differences that exist between the two:

  • Purpose — Backups work best when you need to gain access to a lost or damaged file or object, such as an email, PowerPoint presentation, or database. Backups are also used for long-term data archival, or for purposes such as data retention. However, if you want your business to quickly restore its functions after an unforeseen event, you should opt for disaster recovery. With both the disaster recovery site and solution in place, you can perform a failover to transfer applications to the disaster recovery site, and your business can continue to function as normal even if the production site is unavailable. On the other hand, restoring a single piece of data (such as a file) is much easier to do using a backup of that data, rather than recovering an entire server where that data was stored.

  • RTO and RPO — Setting RTO and RPO is crucial for any business. Because restoring data from backups often does not help with business continuity, the concept of RTOs and RPOs is not applicable. Disaster recovery, on the other hand, implies replicating your critical applications with the aim of quickly performing failover if necessary to assure the business continuity of the affected applications.

  • Resource allocation — Backups are usually stored in a compressed state and do not need to be restored quickly. Therefore, backups normally use low-cost and low-performance storage (frequently off site). Disaster recovery, on the other hand, requires a separate site with operational IT infrastructure that should always be ready for a possible failover at any time.

In recent years, the term disaster recovery solution has become very popular, with different meanings in different cases. Therefore, it’s important to analyze each product to make sure it fulfills the business continuity needs of the organization, including RPO, RTO, and the ability to quickly continue running the application from the disaster recovery site in case the source site loses functionality.

Disaster recovery compared to high availability

High availability (HA) and disaster recovery rely on some of the same best practices, such as monitoring for failures, deploying to multiple locations, and failing over. However, high availability focuses on a single component failure, whereas disaster recovery focuses on continuity in case of a wider failure of the entire application or significant parts of the application.

Disaster recovery has different objectives from high availability. Your disaster recovery strategy requires different approaches than those for high availability, focusing on deploying discrete systems (usually to multiple locations to minimize the impact of a local issue), so that you can fail over the entire application if necessary.

For example, an application that runs on a single virtual machine (VM) in a data center is not highly available. If a local flooding issue affects that data center, this scenario requires failover to another location to meet recovery objectives. Compare this scenario to a highly available application that is deployed across multiple active Availability Zones in the same AWS Region and all Availability Zones are serving production traffic. In this case, even in the localized event of one Availability Zone failing, the high availability strategy is accomplished by automatically routing all traffic to the remaining functional Availability Zones.

How you approach data resilience is also different between high availability and disaster recovery. Consider a storage solution that synchronously replicates to a nearby storage appliance to achieve the high availability of persistent data. If a file or files are mistakenly deleted or corrupted, those destructive changes will be replicated to the secondary storage device. In this scenario, despite the high availability of the storage itself, the ability to recover data in the case of data deletion or corruption is not present. When using a disaster recovery solution in the same scenario, normally a point-in-time-recovery capability is included that can be used.

Another difference between high availability and disaster recovery is how a failover is initiated. In high availability solutions, an event is initiated automatically when needed for high availability (normally within seconds), which results in little to no impact on the end user. In disaster recovery solutions, failing over often incurs additional financial or non-financial impact (for example, the need to fail back all the new data after the disaster is over or the need to provision more resources in the disaster recovery site). Therefore, human intervention is required to initiate a failover event. Also, failing over is normally not instantaneous, and the application remains down until the failover is complete. A well-designed disaster recovery plan should define who is authorized to initiate a failover, how to reach these people, and what they need to consider when making the decision to fail over applications.

Lastly, in most cases, high availability solutions need to be selected at the time an application is designed (or refactored), as they are an integral part of the application. Disaster recovery solutions may be added to an existing application without significant re-architecture or modification work in the application itself.

Which applications require disaster recovery?

A malfunction of almost any application has a negative impact on the organization. No matter the size or role of the application, any malfunction on a key application or even a non-production application can have a negative impact; the more critical the application, the greater the impact. Therefore, all applications can benefit from a disaster recovery solution that can help quickly and easily mitigate any malfunction.

To determine whether to implement a disaster recovery solution, you need to consider the return on investment (ROI). On the one hand, each disaster recovery solution has direct and indirect costs such as software licenses and hardware, infrastructure, maintenance, and drills. On the other hand, every time a disaster strikes, it incurs costs of its own. To determine your maximum TCO for a disaster recovery solution for each of your applications, you’ll need to perform a disaster risk analysis: what is the probability of a disaster happening and what are the direct and indirect financial consequences of the disaster?