Multi-Region fundamental 2: Understanding the data - AWS Prescriptive Guidance

Multi-Region fundamental 2: Understanding the data

Managing data is a non-trivial problem when you adopt multi-Region architectures. The geographical distance between Regions imposes an unavoidable latency that manifests as the time it takes to replicate data across Regions. Trade-offs between availability, data consistency, and introducing higher latency into a workload that uses a multi-Region architecture will be necessary. Whether you use asynchronous or synchronous replication, you will need to modify your application to handle the behavioral changes the replication technology imposes. Challenges around data consistency and latency make it very difficult to take an existing application that was designed for a single Region and make it multi-Region. Understanding the data consistency requirements and data access patterns for particular workloads is critical to weighing the trade-offs.

2.a: Understanding data consistency requirements

The CAP theorem provides a reference for reasoning about the trade-offs among data consistency, availability, and network partitions. Only two of these requirements can be satisfied at the same time for a workload. By definition, a multi-Region architecture includes network partitions between Regions, so you have to choose between availability and consistency.

If you select availability of data across Regions, you won't incur significant latency during transactional write operations, because the reliance on asynchronous replication of committed data between Regions results in reduced consistency across Regions until the replication completes. With asynchronous replication, when there is a failure in the primary Region, there is a high probability that write operations will be pending replication from the primary Region. This leads to a scenario where the latest data is unavailable until replication resumes, and a reconciliation process is needed to handle in-flight transactions that didn't replicate from the Region that experienced the disruption. This scenario requires understanding your business logic and creating a specific process to replay the transaction or compare data stores between Regions.

For workloads where asynchronous replication is favored, you can use services such as Amazon Aurora and Amazon DynamoDB for asynchronous cross-Region replication. Both Amazon Aurora global databases and Amazon DynamoDB global tables have default Amazon CloudWatch metrics to aid in monitoring replication lag. An Aurora global database consists of one primary Region where your data is written, and up to five read-only secondary Regions. DynamoDB global tables consist of multi-active replica tables across any number of Regions that your data is written to and read from.

Engineering the workload to take advantage of event-driven architectures is a benefit for a multi-Region strategy, because it means that the workload can embrace asynchronous replication of data and enables the reconstruction of state by replaying events. Because streaming and messaging services buffer message payload data in a single Region, a Regional failover or failback process must include a mechanism to redirect client input data flows. The process must also reconcile in-flight or undelivered payloads stored in the Region that experienced the disruption.

If you choose the CAP consistency requirement and use a synchronously replicated database across Regions to support your applications that run concurrently from multiple Regions, you remove the risk of data loss and keep the data in sync between Regions. However, this introduces higher latency characteristics, because writes need to commit to more than one Region, and the Regions can be hundreds or thousands of miles from one another. You need to account for this latency characteristic in your application design. In addition, synchronous replication can introduce the chance for correlated failures because writes will need to be committed to more than one Region to be successful. If there is an impairment within one Region, you will need to form a quorum for writes to be successful. This typically involves setting up your database in three Regions and establishing a quorum of two out of three Regions. Technologies such as Paxos can help replicate and commit data synchronously but require significant developer investment.

When writes involve synchronous replication across multiple Regions to meet strong consistency requirements, write latency increases by an order of magnitude. A higher write latency is not something that you can typically retrofit into an application without significant changes, such as revisiting the timeout and retry strategy for your application. Ideally, it must be taken into consideration when the application is first being designed. For multi-Region workloads where synchronous replication is a priority, AWS Partner solutions can help.

2.b: Understanding data access patterns

Workload data access patterns are either read-intensive or write-intensive. Understanding this characteristic for a particular workload will help you select an appropriate multi-Region architecture.

For read-intensive workloads such as static content that is completely read-only, you can achieve an active-active multi-Region architecture that has less engineering complexity when compared with a write-intensive workload. Serving static content at the edge by using a content delivery network (CDN) ensures availability by caching content that's closest to the end user; using feature sets such as origin failover within Amazon CloudFront can help achieve this. Another option is to deploy stateless compute in multiple Regions and use DNS to route users to the closest Region to read the content. You can use Amazon Route 53 with a geolocation routing policy to achieve this.

For read-intensive workloads that have a larger percentage of read traffic than write traffic, you can use a read local, write global strategy. This means that all write requests go to a database in a specific Region, the data is replicated asynchronously to all other Regions, and reads can be done in any Region. This approach requires a workload to embrace eventual consistency, because local reads might become stale as a result of increased latency for cross-Region replication of writes.

Aurora global databases can help provision read replicas in a standby Region that can solely handle all read traffic locally, and provision a single primary data store in a specific Region to handle write traffic. Data is asynchronously replicated from the primary database to standby databases (read replicas), and the standby databases can be promoted to primary if you need to fail over operations to the standby Region. You can also use DynamoDB in this approach. DynamoDB global tables can provision replica tables across Regions that can each scale to support any volume of local read or write traffic. When an application writes data to a replica table in one Region, DynamoDB automatically propagates the write to the other replica tables in the other Regions. With this configuration, data is asynchronously replicated from a defined primary Region to replica tables in standby Regions. Replica tables in any Region can always accept writes, so promoting a standby Region to primary is managed at the application level. Again, the workload has to embrace eventual consistency, which might require it to be rewritten if it wasn't designed for this from the start.

For write-intensive workloads, a primary Region should be selected and the capability to fail over to a standby Region should be engineered into the workload. Compared with an active-active approach, a primary-standby approach has additional trade-offs. This is because for an active-active architecture, the workload has to be rewritten to handle intelligent routing to Regions, establish session affinity, ensure idempotent transactions, and handle potential conflicts.

Most workloads that use a multi-Region approach for resilience won't require an active-active approach. You can use a sharding strategy to provide increased resilience by limiting the scope of impact of an impairment across the client base. If you can effectively shard a client base, you can select different primary Regions for each shard. For example, you can shard clients so that half of the clients are aligned to Region one and half are aligned to Region two. By treating Regions as cells, you can create a multi-Region cell approach, which results in reducing the scope of impact for your workload. For more information, see the AWS re:Invent presentation about this approach.

You can combine the sharding approach with a primary-standby approach to provide failover capabilities for the shards. You will need to engineer a tested failover process into the workload and a process for data reconciliation as well, to ensure transactional consistency of the data stores after failover. These are covered in greater detail later in this guide.

Key guidance

  • There is a high probability that writes pending for replication won't be committed to the standby Region when there is a failure. Data will be unavailable until replication resumes (assuming asynchronous replication).

  • As part of failover, a data reconciliation process will be needed to ensure that a transactionally consistent state is maintained for data stores that use asynchronous replication. This requires specific business logic and is not something that is handled by the data store itself.

  • When strong consistency is required, workloads will need to be modified to tolerate the required latency of a data store that synchronously replicates.