Appendix A: Glossary - Disaster Recovery of On-Premises Applications to AWS

ACID – atomicity, consistency, isolation, durability (ACID) is a set of software properties (often applied to databases) needed to guarantee data validity despite errors, power failures, and other mishaps. Application-consistent recovery has no benefits over crash-consistent recovery for ACID-compliant applications and crash-consistent recovery is possible without any performance impact on the application. This makes it easier to perform recovery of ACID-compliant applications. Almost all modern software, including databases (such as Microsoft SQL Server, Oracle, and SAP HANA) and file systems are ACID-compliant.

Application consistency – Application consistency is a method used to help ensure that a recovered application functions correctly. This method requires the application to be aware of the replication process and participate in it. For example, an attempt to take an application-consistent snapshot of a database will cause the database to pause all pending transactions, complete all the transactions in-flight, commit all the changes from them to the disk, take the snapshot, and then resume normal operation. Application consistency is the oldest consistency type; however, application-consistent snapshots are rarely needed today as the vast majority of modern applications are ACID-compliant (refer to Appendix A: Glossary ACID entry) and recover properly from a crash-consistent snapshot (which has significantly less performance impact when taken).

Application-level disaster recovery solution – A disaster recovery solution that operates on the application level. Both the data replication and recovery processes are application aware. The advantage of such solutions is that application awareness may offer extra functionality, such as application-consistent snapshots or replication from a larger server to a smaller server. The disadvantages of such solutions are that there are not many options available for popular applications because they only function for the application they are designed for and they are usually expensive.

Business continuity – The ability of an organization to continue operating correctly following a disruptive incident.

Cold site – A cold site is a disaster recovery site maintained for disaster preparation that provides power, network connectivity, air conditioning, and other elements necessary to maintain data. In the case of a disaster, you need to install the hardware, software, and data before recovery can take place. Cold sites are among the most cost-effective disaster recovery solutions, but they are also the slowest and least reliable.

Continuous data protection – Real-time (or near real-time) protection of data as it changes. Each change made in the source application is copied to the disaster recovery site in real time. With continuous data protection, you can achieve minimal RPO and ensure the data on your disaster recovery site is up to date with your production site. There are two main categories of continuous data protection: synchronous and asynchronous.

Synchronous data replication first commits changes to the disaster recovery site and only then commits the data to persistent storage after it commits changes to the disaster recovery site. This method achieves true zero RPO. However, because the network latency between the two sites is added to each disk write operation, it is only applicable for situations where that latency is very low. Therefore, synchronous data replication is mostly used for high availability within the same network, rather than for long-distance disaster recovery. Because synchronous replication has a significant impact on the I/O performance of the application, this needs to be considered when designing the application and defining its performance goals.

Asynchronous data replication does not affect the performance of the source application because it commits the changes to the disaster recovery site and persistent storage in parallel. This makes it the most common method of continuous data protection for disaster recovery. However, the RPO that normally can be achieved with this method is near zero (less than a second) rather that absolute zero.

Disaster – An event that is partially or completely disruptive to the functioning of one or more applications and cannot be resolved automatically.

Disaster recovery – The process of preparing for and recovering from a disruptive event.

Disaster Recovery as a Service (DRaaS) – A method of disaster recovery whereby infrastructure, personnel, equipment, drills, and other associated components are outsourced to a third-party provider.

Disaster recovery drill – An implementation of the section of the disaster recovery plan dealing with response to a disaster for testing purposes rather than in case of a real disaster event. By following the exact steps in the plan and verifying that the disaster recovery site is functioning and is able to provide the required business continuity within the required RTO and RPO, you can confirm that this would also be the case if a real disaster strikes. Disaster recovery drills are performed periodically. The frequency of such a drill is based on multiple factors, such as requirements by compliance certifications and the cost of each drill for the organization.

Disaster recovery plan – A plan of action (that should be as detailed as possible) to get your IT systems back online in the case of a disaster. The disaster recovery plan (DRP) should provide clear instructions and documentation for how to decide on and implement a recovery in the event of a disaster as well as how to return to normal operation after the disaster is over.

Disaster recovery site – A site, data center, or cloud environment that is used to run recovered applications instead of the production environment in order to provide business continuity in case of a disaster.

Failback – The process of returning to your source site from your disaster recovery site following a failover after your original applications have been restored.

Failover – The process of switching from your production site to your disaster recovery site in the event of a disaster. After your production site is operational again, you implement a failback to return to normal operations.

Fault tolerance – The ability of a system to remain in operation even if some of the components used to build the system fail.

High availability – A system’s ability to continue functioning even if a component in the system experiences a malfunction. High availability is a part of the overall resilience strategy that is complementary to disaster recovery.

Hot site – A remote, fully operational copy of your source site that is ready to become active at any moment. In the event of a disaster, you can switch your operations to your hot site without additional steps. A hot site would typically provide the shortest RTO; however, its TCO is normally significantly higher than the alternatives.

Hypervisor-level disaster recovery solution – A disaster recovery solution that runs on a hypervisor level and allows disaster recovery of VMs without installing agents on the VMs themselves. The disadvantage of such solutions is that they are hypervisor specific for both source and recovery site.

Point-in-time recovery – A function of a disaster recovery solution that enables an organization to restore or recover data that was saved during a specific time from a specific point-in-time snapshot, which is a copy of all data saved to the disaster recovery site.

Server-level disaster recovery solution – A disaster recovery solution that operates from within every server (usually by an agent running on the server) that can be used to replicate various disks and partitions on the server regardless of the applications using them. The advantage of such a solution is that it is both hardware/hypervisor and application agnostic. The disadvantage is that you need to install an agent on each server, which in turn can introduce operational overhead due to the need to maintain all the agents as operational.

Source site – The original production site where your applications are running and their data is stored. A source site can be on premises or cloud based. You can have more than one source site.

Staging area subnet – A part of the disaster recovery site dedicated to storing the resources used during normal operation for replication purposes. This way such resources can be segregated from the section of the disaster recovery site where applications are launched during a failover or drill.

Recovery point objective (RPO) – RPO is the maximum acceptable gap between the data in the disaster recovery site and the latest data stored in the application when the disaster strikes. This objective determines what is considered an acceptable loss of data (measured in time units) that can be caused by a disaster.

Recovery time objective (RTO) – RTO is the maximum acceptable delay between the interruption of an application and the restoration of its service. This objective determines what is considered an acceptable time window for an application to be unavailable.

Server-level disaster recovery solution – A disaster recovery solution that operates from within every server (usually by an agent running on the server) that can be used to replicate various disks and partitions on the server regardless of the applications using them. The advantage of such a solution is that it is both hardware/hypervisor and application agnostic. The disadvantage is that you need to install an agent on each server, which in turn can introduce operational overhead due to the need to maintain all the agents as operational.

Snapshot shipping – A common method for data replication. It consists of taking a snapshot or a group of snapshots of the servers or disks on the source site and sending them over the network to the disaster recovery site, where they are stored, ready to be utilized to create new volumes or servers.

Storage-level disaster recovery solution – One of the most expensive and oldest types of disaster recovery solutions. In these solutions, replication and recovery are performed on the storage appliance level and the data is replicated into a secondary storage appliance in the disaster recovery site. Such solutions tend to be very well integrated into the appliances and provide very good RTOs and RPOs. The disadvantages include that usually every storage vendor has only one relevant disaster recovery solution and no third-party alternatives. Another disadvantage is that they require the disaster recovery site to include a storage appliance as well, usually of the same manufacturer as the primary one, which makes this solution unsuitable for public clouds where customers don’t have access to the configuration of the low-level block storage.

Total cost of ownership (TCO) – The sum of all direct and indirect costs incurred as part of implementation and maintenance of a disaster recovery solution and plan. The TCO may have a fixed component (capital expenditures (CapEx)), a variable per-use component (operating expenses (OpEx)), or a combination of both.

Warm site/pilot light – A warm site is a disaster recovery site where you have a small but functional copy of the source site. In the event of a disaster, the site is increased to be able to withstand the entire load of the source site. The idea of the pilot light is an analogy that comes from the gas heater. In a gas heater, a small idle flame that’s always on can quickly ignite the entire furnace to heat up a house as needed.