Multi-Region fundamental 2: Understanding the data
Managing data is a non-trivial problem with multi-Region architectures. The geographical distance between Regions imposes an unavoidable latency, which manifests as the time it takes to replicate data across Regions. Trade-offs between availability, data consistency, and introducing higher orders of magnitude of latency into a workload that uses a multi-Region architecture will be necessary. Whether using asynchronous or synchronous replication you’ll need to modify your application to handle the behavioral changes the replication technology imposes. It’s very difficult to take an existing application that was designed for a single-Region and make it multi-Region due to challenges around data consistency and latency. Understanding the data consistency requirements and data access patterns for particular workloads is critical to weighing the trade-offs.
2a: Understanding data consistency requirements
The CAP theorem provides a reference for reasoning about the tradeoffs between data consistency, availability, and network partitions, of which only two can be satisfied at the same time for a workload. Multi-Region by definition includes network partitions between Regions, so you have to choose between availability and consistency.
If you select availability of the data across Regions, you will not incur significant latency during transactional writes, because there is a reliance on asynchronous replication of committed data between Regions, resulting in reduced consistency across Regions until the replication completes. With asynchronous replication, when there is a failure in the primary Region, there is a high probability of writes pending replication from the primary Region. This leads to a scenario where the latest data is unavailable until replication resumes, and a reconciliation process is needed to handle in-flight transactions that didn’t replicate from the Region that experienced the outage.
For workloads where asynchronous replication is favored, you can
use services such as
Amazon Aurora
Engineering the workload to take advantage of event-driven architectures is a benefit for a multi-Region strategy because it means the workload can embrace asynchronous replication of data, and enables the reconstruction of state by replaying events. Because streaming and messaging services buffer message payload data in a single Region, a Regional failover/failback process must include a mechanism to redirect client input data flows, as well as reconcile in-flight and/or undelivered payloads stored in the Region that experienced the outage.
If consistency is selected, you will incur significant latency as
data are synchronously replicated during transactional writes.
When writing to multiple Regions synchronously, if the write isn’t
successful in all Regions, availability is potentially lowered
because the transaction will not commit, and will need to be
retried. Retries that attempt to write the data to all Regions
synchronously are done at the cost of latency with every attempt.
At some point, when the retries have been exhausted, the decision
will need to be made to either fail the transaction completely,
thereby reducing availability, or commit the transaction to
available Regions only, thereby leading to inconsistency. There
are quorum forming technologies such as
Paxos
When writes involve synchronous replication across multiple Regions to meet strong
consistency requirements, write latency increases by an order of magnitude. A higher write
latency is not something that can typically be retro-fitted into an application without
significant changes. Ideally, it must be taken into consideration when the application is
first being designed. For multi-Region workloads where synchronous replication is a priority,
AWS Partner solutions
2b: Understanding the data access patterns
Workload data access patterns fall into one of the following types: read-intensive or write-intensive. Understanding this characteristic for a particular workload will guide selection of an appropriate multi-Region architecture.
For read-intensive workloads such as static content that are
completely read-only, an
active/active
For read-intensive workloads that have a larger percentage of
reads than writes, a
read
local, write global strategy
Aurora
Global Database
For write-intensive workloads, a primary Region should be selected
and the capability to failover to a standby Region should be
engineered into the workload. As compared to an active/active
approach, a
primary/standby
Most workloads that are looking at multi-Region for resilience
won’t require an active/active approach. A
sharding
The sharding approach can be combined with a primary/standby approach to provide failover capabilities for the shards. A tested failover process will need to be engineered into the workload and a process for data reconciliation will need to be engineered as well to ensure transactional consistency of the data stores after failover. These are covered in greater detail later in this paper.
Key guidance
-
There is a high probability that writes pending for replication won't be committed to the standby Region when there is a failure. Data will be unavailable until replication resumes (assuming asynchronous replication).
-
As part of failover, a data reconciliation process will be needed to ensure a transactionally consistent state is maintained for datastores using asynchronous replication.
-
When strong consistency is required, workloads will need to be modified to tolerate required latency of datastore that synchronously replicates.