Gray failures
Gray failures are defined by the characteristic of differential
observability
Differential observability
The workloads that you operate typically have dependencies. For example, these can be the AWS cloud services that you use to build your workload or a third-party identity provider (IdP) you use for federation. Those dependencies almost always implement their own observability, recording metrics about errors, availability, and latency among other things that are generated by their customer usage. When a threshold is crossed for one of these metrics, the dependency usually takes some action to correct it.
These dependencies usually have multiple consumers of their services. Consumers also implement their own observability and record metrics and logs about their interactions with their dependencies, recording things like how much latency there is in disk reads, how many API requests failed, or how long a database query took.
These interactions and measurements are depicted in an abstract model in the following figure.
First, we have the system, which is a dependency for consumers App 1, App 2, and App 3 in this scenario. The system has a failure detector that examines metrics created from the core business process. It also has a failure response mechanism to mitigate or correct problems that are observed by the failure detector. The system sees an overall average latency of 53 ms and has set a threshold to invoke the failure response mechanism when average latency exceeds 60 ms. App 1, App 2, and App 3 are also making their own observations about their interaction with the system, recording an average latency of 50 ms, 53 ms, and 56 ms respectively.
Differential observability is the situation where one of the system consumers detects that the system is unhealthy, but the system’s own monitoring does not detect the problem or the impact does not cross an alarm threshold. Let’s imagine that App 1 starts experiencing an average latency of 70 ms instead of 50ms. App 2 and App 3 don’t see a change in their average latencies. This increases the average latency of the underlying system to 59.66 ms, but this does not cross the latency threshold to activate the failure response mechanism. However, App 1 sees a 40% increase in latency. This could impact its availability by exceeding the configured client timeout for App 1, or it may cause cascading impacts in a longer chain of interactions. From the perspective of App 1, the underlying system it depends on is unhealthy, but from the perspective of the system itself as well App 2 and App 3, the system is healthy. The following figure summarizes these different perspectives.
The failure can also traverse this quadrant. An event could start as a gray failure, then become a detected failure, then move to a masked failure, and then maybe back to a gray failure. There’s not a defined cycle, and there’s almost always a chance of failure recurrence until its root cause is addressed.
The conclusion we draw from this is that workloads can’t always rely on the underlying system to detect and mitigate the failure. No matter how sophisticated and resilient the underlying system is, there will always be the chance that a failure could go undetected or stay under the reaction threshold. The consumers of that system, like App 1, need to be equipped to both quickly detect and mitigate the impact a gray failure causes. This requires building observability and recovery mechanisms for these situations.
Gray failure example
Gray failures can have impact for multi-AZ systems in AWS. For example, take a fleet of
Amazon EC2
In this example, Amazon EC2 sees the instances in Availability Zone 1 as healthy because they continue to pass system and instance status checks. Amazon EC2 Auto Scaling also doesn't detect direct impact to any Availability Zone, and continues to launch capacity in the configured Availability Zones. The Network Load Balancer (NLB) also sees the instances behind it as healthy as do the Route 53 health checks that are performed against the NLB endpoint. Similarly, Amazon Relational Database Service (Amazon RDS) sees the database cluster as healthy and does not trigger an automated failover. We have many different services that all see their service and resources as healthy, but the workload detects a failure that impacts its availability. This is a gray failure.
Responding to gray failures
When you experience a gray failure in your AWS environment, you generally have three available options:
-
Do nothing and wait for the impairment to end.
-
If the impairment is isolated to a single Availability Zone, evacuate that Availability Zone.
-
Failover to another AWS Region and use the benefits of AWS Regional isolation to mitigate the impact.
Many AWS customers are fine with option one for a majority of their workloads. They
accept having a possibly extended Recovery Time Objective (RTO) with the tradeoff that they haven’t had to build
additional observability or resilience solutions. Other customers choose to implement the
third option, Multi-Region Disaster Recovery
First, building and operating a multi-Region architecture can be a challenging, complex, and potentially expensive endeavor. Multi-Region architectures require careful consideration of which DR strategy you select. It might not be fiscally viable to implement a multi-Region active-active DR solution just to handle zonal impairments, while a backup and restore strategy might not meet your resilience requirements. Additionally, multi-Region failovers must be continuously practiced in production so that you are confident they will work when needed. This all requires a lot of dedicated time and resources to build, operate, and test.
Second, data replication across AWS Regions using AWS services today is all done asynchronously. Asynchronous replication can result in data loss. This means that during a Regional failover, there is a chance for some amount of data loss and inconsistency. Your tolerance to the amount of data loss is defined as your Recovery Point Objective (RPO). Customers, for whom strong data consistency is a requirement, have to build reconciliation systems to fix these consistency issues when the primary Region is available again. Or, they have to build their own synchronous replication or dual-write systems, which can have significant impacts on response latency, cost, and complexity. They also make the secondary Region a hard dependency for every transaction, which can potentially reduce the availability of the overall system.
Finally, for many workloads using an active/standby approach, there is a non-zero amount of time required to perform the failover to another Region. Your portfolio of workloads might need to be brought down in the primary Region in a specific order, need to drain connections, or stop specific processes. Then, the services might need to be brought back up in a specific order. New resources might also need be provisioned or require time to pass required health checks before being brought into service. This failover process can be experienced as a period of complete unavailability. This is what RTOs are concerned with.
Inside a Region, many AWS services offer strongly consistent data persistence. Amazon RDS
multi-AZ deployments use synchronous replication.
Amazon Simple Storage Service
Evacuating an Availability Zone can have a lower RTO than a multi-Region strategy, because
your infrastructure and resources are already provisioned across Availability Zones. Instead
of needing to carefully order services being brought down and back up, or draining
connections, multi-AZ architectures can continue operating in a static way when an
Availability Zone is impaired. Instead of a period of complete unavailability that can occur
during a Regional failover, during an Availability Zone evacuation, many systems might see
only a slight degradation, as work is shifted to the remaining Availability Zones. If the
system has been designed to be statically stable
It's possible that the impairment of a single Availability Zone impacts one or more AWS Regional services in addition to your workload. If you observe Regional impact, you should treat the event as a Regional service impairment although the source of that impact is from a single Availability Zone. Evacuating an Availability Zone will not mitigate this type of problem. Use the response plans you have in place to respond to a Regional service impairment when this occurs.
The rest of this document focuses on the second option, evacuating the Availability Zone, as a way to achieve lower RTOs and RPOs for single-AZ gray failures. These patterns can help achieve better value and efficiency of multi-AZ architectures and, for most classes of workloads, can reduce the need to create multi-Region architectures to handle these types of events.