Static stability - AWS Fault Isolation Boundaries

Static stability

One of the most important resilience characteristics of AWS services is what AWS calls static stability. What this term means is that systems operate in a static state and continue to operate as normal without the need to make changes during the failure or unavailability of dependencies. One way we do this is by preventing circular dependencies in our services that could stop one of those services from successfully recovering. Another way we do this is by maintaining existing state. We consider the fact that control planes are statistically more likely to fail than data planes. Although the data plane typically depends on data that arrives from the control plane, the data plane maintains its existing state and continues working even in the face of control plane impairment. Data plane access to resources, once provisioned, has no dependency on the control plane, and therefore is not affected by any control plane impairment. In other words, even if the ability to create, modify, or delete resources is impaired, existing resources remain available. This makes AWS data planes statically-stable to an impairment in the control plane. You can implement different patterns to be statically-stable against different types of dependency failures.

An example of static stability can be found in Amazon EC2. Once an EC2 instance has been launched, it is just as available as the physical server in a data center. It does not depend on any control plane APIs in order to stay running, or to start running again after a reboot. The same property holds for other AWS resources like VPCs, Amazon S3 buckets and objects, and Amazon EBS volumes.

Static stability is a concept that is deeply ingrained in how AWS designs its services, but it is also a pattern that can be used by customers. In fact, a majority of the best practice guidance for using the different types of AWS services in a resilient way is to implement static stability for production environments. The most reliable recovery and mitigation mechanisms are the ones that require the fewest changes to achieve recovery. Instead of relying on the EC2 control plane to launch new EC2 instances to recover from a failed Availability Zone, having that extra capacity pre-provisioned helps achieve static stability. Thus, eliminating dependencies on control planes (the APIs that implement changes to resources) in your recovery path helps produce more resilient workloads. For more details on static stability, control planes, and data planes, refer to the Amazon Builders’ Library article Static stability using Availability Zones.