Regional services - AWS Fault Isolation Boundaries

Regional services

Regional services are services that AWS has built on top of multiple Availability Zones so that customers don’t have to figure out how to make the best use of zonal services. We logically group together the service deployed across multiple Availability Zones to present a single Regional endpoint to customers. Amazon SQS and Amazon DynamoDB are examples of Regional services. They use the independence and redundancy of Availability Zones to minimize infrastructure failure as a category of availability and durability risk. Amazon S3, for example, spreads requests and data across multiple Availability Zones and is designed to automatically recover from the failure of an Availability Zone. However, you only interact with the Regional endpoint of the service.

AWS believes that most customers can achieve their resilience goals in a single Region by using Regional services or Multi-AZ architectures that rely on zonal services. However, some workloads may require additional redundancy, and you can use the isolation of AWS Regions to create Multi-Region architectures for HA or business continuity purposes. The physical and logical separation between AWS Regions avoids correlated failures between them. In other words, similar to if you were an EC2 customer and could benefit from the isolation of Availability Zones by deploying across them, you can get that same benefit for Regional services by deploying across multiple Regions. This requires that you implement a multi-Region architecture for your application, which can help you be resilient to the impairment of a Regional service.

However, achieving the benefits of a Multi-Region architecture can be difficult; it requires careful work to take advantage of Regional isolation while not undoing anything at the application level. For example, if you’re failing over an application between Regions, you need to maintain strict separation between your application stacks in each Region, be aware of all the application dependencies, and failover all parts of the application together. Achieving this with a complex, microservices-based architecture that has many dependencies between applications requires planning and coordination amongst many engineering and business teams. Allowing individual workloads to make their own failover decisions makes the coordination less complex, but introduces modal behavior through the significant difference in latency that occurs across Regions compared to inside a single Region.

AWS does not provide a synchronous Cross-Region replication feature at this time. When using an asynchronously replicated datastore (provided by AWS) across Regions, there is the possibility of data loss or inconsistency when you fail over your application between Regions. To mitigate possible inconsistencies, you need a reliable data reconciliation process that you have confidence in and may need to operate on multiple data stores across your workload portfolio, or you need to be willing to accept data loss. Finally, you need to practice the failover to know that it will work when you need it. Regularly rotating your application between Regions to practice failover is a substantial time and resource investment. If you decide to use a synchronously replicated datastore across Regions to support your applications running from more than one Region concurrently, the performance characteristics and latency of such a database that spans 100s or 1000s of miles is very different from a database operating in a single Region. This requires you to plan your application stack from the ground up to account for this behavior. It also makes the availability of both Regions a hard dependency, which could result in decreased resilience of your workload.