Multi-Region fundamental 2: Understanding the data
Managing data is a non-trivial problem when you adopt multi-Region architectures. The geographical distance between Regions imposes an unavoidable latency that manifests as the time it takes to replicate data across Regions. Trade-offs between availability, data consistency, and introducing higher latency into a workload that uses a multi-Region architecture will be necessary. Whether you use asynchronous or synchronous replication, you will need to modify your application to handle the behavioral changes the replication technology imposes. Challenges around data consistency and latency make it very difficult to take an existing application that was designed for a single Region and make it multi-Region. Understanding the data consistency requirements and data access patterns for particular workloads is critical to weighing the trade-offs.
2.a: Understanding data consistency requirements
The CAP theorem provides a reference for reasoning about the trade-offs among data consistency, availability, and network partitions. Only two of these requirements can be satisfied at the same time for a workload. By definition, a multi-Region architecture includes network partitions between Regions, so you have to choose between availability and consistency.
If you select availability of data across Regions, you won't incur significant latency during transactional write operations, because the reliance on asynchronous replication of committed data between Regions results in reduced consistency across Regions until the replication completes. With asynchronous replication, when there is a failure in the primary Region, there is a high probability that write operations will be pending replication from the primary Region. This leads to a scenario where the latest data is unavailable until replication resumes, and a reconciliation process is needed to handle in-flight transactions that didn't replicate from the Region that experienced the disruption. This scenario requires understanding your business logic and creating a specific process to replay the transaction or compare data stores between Regions.
For workloads where asynchronous replication is favored, you can use services such
as Amazon Aurora
Engineering the workload to take advantage of event-driven architectures is a benefit for a multi-Region strategy, because it means that the workload can embrace asynchronous replication of data and enables the reconstruction of state by replaying events. Because streaming and messaging services buffer message payload data in a single Region, a Regional failover or failback process must include a mechanism to redirect client input data flows. The process must also reconcile in-flight or undelivered payloads stored in the Region that experienced the disruption.
If you choose the CAP consistency requirement and use a synchronously replicated
database across Regions to support your applications that run concurrently from multiple
Regions, you remove the risk of data loss and keep the data in sync between Regions.
However, this introduces higher latency characteristics, because writes need to commit
to more than one Region, and the Regions can be hundreds or thousands of miles from one
another. You need to account for this latency characteristic in your application design.
In addition, synchronous replication can introduce the chance for correlated failures
because writes will need to be committed to more than one Region to be successful. If
there is an impairment within one Region, you will need to form a quorum for writes to be
successful. This typically involves setting up your database in three Regions and
establishing a quorum of two out of three Regions. Technologies such as Paxos
When writes involve synchronous replication across multiple Regions to meet strong
consistency requirements, write latency increases by an order of magnitude. A higher
write latency is not something that you can typically retrofit into an application
without significant changes, such as revisiting the timeout and retry strategy for your
application. Ideally, it must be taken into consideration when the application is first
being designed. For multi-Region workloads where synchronous replication is a priority,
AWS Partner solutions
2.b: Understanding data access patterns
Workload data access patterns are either read-intensive or write-intensive. Understanding this characteristic for a particular workload will help you select an appropriate multi-Region architecture.
For read-intensive workloads such as static content that is completely read-only, you
can achieve an active-active
For read-intensive workloads that have a larger percentage of read traffic than write
traffic, you can use a read local, write global strategy
Aurora global databases
For write-intensive workloads, a primary Region should be selected and the capability
to fail over to a standby Region should be engineered into the workload. Compared with
an active-active approach, a primary-standby
Most workloads that use a multi-Region approach for resilience won't require an
active-active approach. You can use a sharding
You can combine the sharding approach with a primary-standby approach to provide failover capabilities for the shards. You will need to engineer a tested failover process into the workload and a process for data reconciliation as well, to ensure transactional consistency of the data stores after failover. These are covered in greater detail later in this guide.
Key guidance
-
There is a high probability that writes pending for replication won't be committed to the standby Region when there is a failure. Data will be unavailable until replication resumes (assuming asynchronous replication).
-
As part of failover, a data reconciliation process will be needed to ensure that a transactionally consistent state is maintained for data stores that use asynchronous replication. This requires specific business logic and is not something that is handled by the data store itself.
-
When strong consistency is required, workloads will need to be modified to tolerate the required latency of a data store that synchronously replicates.