Multi-Region fundamental 2: Understanding the data - AWS Multi-Region Fundamentals

Multi-Region fundamental 2: Understanding the data

Managing data is a non-trivial problem with multi-Region architectures. The geographical distance between Regions imposes an unavoidable latency, which manifests as the time it takes to replicate data across Regions. Trade-offs between availability, data consistency, and introducing higher orders of magnitude of latency into a workload that uses a multi-Region architecture will be necessary. Whether using asynchronous or synchronous replication you’ll need to modify your application to handle the behavioral changes the replication technology imposes. It’s very difficult to take an existing application that was designed for a single-Region and make it multi-Region due to challenges around data consistency and latency. Understanding the data consistency requirements and data access patterns for particular workloads is critical to weighing the trade-offs.

2a: Understanding data consistency requirements

The CAP theorem provides a reference for reasoning about the tradeoffs between data consistency, availability, and network partitions, of which only two can be satisfied at the same time for a workload. Multi-Region by definition includes network partitions between Regions, so you have to choose between availability and consistency.

If you select availability of the data across Regions, you will not incur significant latency during transactional writes, because there is a reliance on asynchronous replication of committed data between Regions, resulting in reduced consistency across Regions until the replication completes. With asynchronous replication, when there is a failure in the primary Region, there is a high probability of writes pending replication from the primary Region. This leads to a scenario where the latest data is unavailable until replication resumes, and a reconciliation process is needed to handle in-flight transactions that didn’t replicate from the Region that experienced the outage.

For workloads where asynchronous replication is favored, you can use services such as Amazon Aurora and Amazon DynamoDB, which provide asynchronous cross-Region replication. Both Amazon Aurora Global Database and Amazon DynamoDB global tables have default Amazon CloudWatch metrics to aid in monitoring replication lag.

Engineering the workload to take advantage of event-driven architectures is a benefit for a multi-Region strategy because it means the workload can embrace asynchronous replication of data, and enables the reconstruction of state by replaying events. Because streaming and messaging services buffer message payload data in a single Region, a Regional failover/failback process must include a mechanism to redirect client input data flows, as well as reconcile in-flight and/or undelivered payloads stored in the Region that experienced the outage.

If consistency is selected, you will incur significant latency as data are synchronously replicated during transactional writes. When writing to multiple Regions synchronously, if the write isn’t successful in all Regions, availability is potentially lowered because the transaction will not commit, and will need to be retried. Retries that attempt to write the data to all Regions synchronously are done at the cost of latency with every attempt. At some point, when the retries have been exhausted, the decision will need to be made to either fail the transaction completely, thereby reducing availability, or commit the transaction to available Regions only, thereby leading to inconsistency. There are quorum forming technologies such as Paxos, which can help replicate and commit data synchronously, but which need significant developer investment.

When writes involve synchronous replication across multiple Regions to meet strong consistency requirements, write latency increases by an order of magnitude. A higher write latency is not something that can typically be retro-fitted into an application without significant changes. Ideally, it must be taken into consideration when the application is first being designed. For multi-Region workloads where synchronous replication is a priority, AWS Partner solutions can help.

2b: Understanding the data access patterns

Workload data access patterns fall into one of the following types: read-intensive or write-intensive. Understanding this characteristic for a particular workload will guide selection of an appropriate multi-Region architecture.

For read-intensive workloads such as static content that are completely read-only, an active/active multi-Region architecture can be achieved without significant complexity. Serving static content at the edge using a Content Distribution Network (CDN) ensures availability by caching content closest to the end user; using feature sets such as Origin failover within Amazon CloudFront can help achieve this. Another option is to deploy stateless compute in multiple Regions and use DNS to route users to the closest Region to read the content. Route 53 with geolocation routing policy can be used to achieve this.

For read-intensive workloads that have a larger percentage of reads than writes, a read local, write global strategy can be used. This entails all writes go to a database in a specific Region with asynchronous replication of data to all the other Regions, and reads can be done in any Region to achieve this. This approach requires a workload to embrace eventual consistency, because local reads may be stale because of increased latency for cross-Region replication of writes.

Aurora Global Database can help with provisioning of Read Replicas in a standby Region that can solely handle all read traffic locally, and a single primary datastore in a specific Region to handle writes. Data is asynchronously replicated from the primary to standby databases (Read Replicas) and the standby databases can be promoted to primary if you need to failover operations to the standby Region. If a workload is better suited for non-relational data models, DynamoDB can be used in this approach as well. Again, the workload needs to embrace eventual consistency, which may require it to be re-written if it wasn’t designed for this from the start.

For write-intensive workloads, a primary Region should be selected and the capability to failover to a standby Region should be engineered into the workload. As compared to an active/active approach, a primary/standby approach is less complicated. This is because for an active/active architecture, the workload will need to be rewritten to handle intelligent routing to Regions, establish session affinity, ensure idempotent transactions, and handle potential conflicts.

Most workloads that are looking at multi-Region for resilience won’t require an active/active approach. A sharding strategy can be used to provide increased resilience by limiting the blast radius of an impairment across the client base. If you can effectively shard a client base, different primary Regions can be selected for each shard. For example, if you can shard clients so half of the clients are aligned to Region One and half are aligned to Region Two, treating Regions as cells, a multi-Region cell approach can be created, which results in reducing the blast radius of impact for your workload.

The sharding approach can be combined with a primary/standby approach to provide failover capabilities for the shards. A tested failover process will need to be engineered into the workload and a process for data reconciliation will need to be engineered as well to ensure transactional consistency of the data stores after failover. These are covered in greater detail later in this paper.

Key guidance

  • There is a high probability that writes pending for replication won't be committed to the standby Region when there is a failure. Data will be unavailable until replication resumes (assuming asynchronous replication).

  • As part of failover, a data reconciliation process will be needed to ensure a transactionally consistent state is maintained for datastores using asynchronous replication.

  • When strong consistency is required, workloads will need to be modified to tolerate required latency of datastore that synchronously replicates.