Zonal services - AWS Fault Isolation Boundaries

Zonal services

Availability Zone Independence (AZI) enables AWS to offer zonal services, like Amazon EC2 and Amazon EBS. A zonal service is one that provides the ability to specify which Availability Zone the resources are deployed into. These services operate independently in each Availability Zone within a Region, and more importantly, fail independently in each Availability Zone as well. This means that components of a service in one Availability Zone don’t take dependencies on components in other Availability Zones. We can do this because a zonal service has zonal data planes. In some cases, such as with EC2, the service also includes zonal control planes for zonally aligned operations, such as launching an EC2 instance. For those services, AWS also provides a regional control plane endpoint to make it easy to interact with the service. The regional control plane also provides Regionally-scoped functionality as well as serves as an aggregation and routing layer on top of the zonal control planes. This is shown in the following figure.

This image shows a zonal service with zonally isolated control planes and data planes

A zonal service with zonally isolated control planes and data planes

Availability Zones give customers the ability to operate production workloads that are more highly available, fault tolerant, and scalable than would be possible from a single data center. When a workload uses multiple Availability Zones, customers are better isolated and protected from issues that impact a single Availability Zone’s physical infrastructure. This helps customers to build services that are redundant across Availability Zones and, if architected correctly, remain operational even if one Availability Zone experiences failures. Customers can take advantage of AZI to create highly-available and resilient workloads. Implementing AZI in your architecture helps you to quickly recover from an isolated Availability Zone failure because your resources in one Availability Zones minimize or eliminate interaction with resources in other Availability Zones. This helps remove cross- Availability Zone dependencies which simplifies Availability Zone evacuation. Refer to Advanced Multi-AZ Resilience Patterns for more details on creating Availability Zone evacuation mechanisms. Additionally, you can further take advantage of Availability Zones by following some of the same best practices AWS uses for its own services, such as only deploying changes to a single Availability Zone at a time or removing an Availability Zone from service if a change in that Availability Zone goes badly.

Static stability is also an important concept for Multi-Availability Zone architectures. One of the failure modes you should plan for with Multi-Availability Zone architectures is the loss of an Availability Zone, which can result in the loss of an Availability Zone’s worth of capacity. If you haven’t pre-provisioned enough capacity to handle the loss of an Availability Zone, this could result in your remaining capacity being overwhelmed by the current load. Additionally, you will need to depend on the control planes of the zonal services you use to replace that lost capacity, which can be less reliable than a statically-stable design. In this case, pre-provisioning enough extra capacity can help you be statically-stable to the loss of a fault domain, such as an Availability Zone, by being able to continue normal operations without the need for dynamic changes.

You may choose to use an auto scaling group of EC2 instances deployed across multiple Availability Zones to dynamically scale in and out, based on the needs of your workload. Auto scaling works well for gradual changes in usage that occur over minutes to tens of minutes. However, launching new EC2 instances takes time, especially if your instances require bootstrapping (such as installing agents, application binaries, or configuration files). During this time, your remaining capacity could be overwhelmed by the current load. Additionally, deploying new instances through auto scaling relies on the EC2 control plane. This presents a trade-off: To be statically-stable to the loss of a single Availability Zone, you need to pre-provision enough EC2 instances in the other Availability Zones to handle the load that has been shifted away from the impaired Availability Zone, instead of relying on auto scaling to provision new instances. However, pre-provisioning extra capacity can incur additional cost.

For example, during normal operation, let’s assume your workload requires six instances to serve customer traffic across three Availability Zones. To be statically-stable against a single Availability Zone failure, you would deploy three instances in each Availability Zone, for a total of nine. If a single Availability Zone-worth of instances failed, you would still have six left and be able to continue to serve your customer traffic without the need to provision and configure new instances during the failure. Achieving static stability for your EC2 capacity has additional cost, since, in this case, you are running 50% additional instances. Not all services where you can pre-provision resources will incur additional cost, such as pre-provisioning an S3 bucket or a user. You will need to weigh any trade-offs of implementing static stability against the risk of exceeding the desired recovery time for your workload.

AWS Local Zones and Outposts bring the data plane of select AWS services closer to end users. The control planes for these services reside in the parent Region. Your Local Zone or Outposts instance will have control plane dependencies for zonal services like EC2 and EBS on the Availability Zone where you created your Local Zone or Outposts subnet. They will also have dependencies on Regional control planes for Regional services like Elastic Load Balancing (ELB), security groups, and the Amazon Elastic Kubernetes Service (Amazon EKS)-managed Kubernetes control plane (if you use EKS). For additional information specific to Outposts, refer to the documentation and support and maintenance FAQ. Implement static stability when using Local Zones or Outposts to help improve resilience to control plane impairments or interruptions in network connectivity to the parent Region.