Increasing MTBF
The final component to improving availability is increasing the MTBF. This can apply to both the software as well as the AWS services used to run it.
Increasing distributed system MTBF
One way to increase MTBF is to reduce defects in the software. There are several ways
to do this. Customers can use tools like Amazon
CodeGuru Reviewer
Deploying smaller changes can also help prevent unexpected outcomes by reducing the complexity of change. Each activity provides an opportunity to identify and fix defects before they can ever be invoked.
Another approach to preventing failure is regular
testing. Implementing a chaos engineering program can help test how your workload
fails, validate recovery procedures, and help find and fix failure modes before they occur
in production. Customers can use AWS Fault Injection
Simulator
Fault tolerance is another way to prevent failure in a distributed system. Fail-fast modules, retries with exponential backoff and jitter, transactions, and idempotency are all techniques to help make workloads fault tolerant.
Transactions are a group of operations that adhere to the ACID properties. They are as follows:
-
Atomicity – Either all of the actions happen or none of them will happen.
-
Consistency – Each transaction leaves the workload in a valid state.
-
Isolation – Transactions performed concurrently leave the workload in the same state as if they had been performed sequentially.
-
Durability – Once a transaction commits, all of its effects are preserved even in the case of workload failure.
Retries with exponential backoff and
jitter
If we consider the effect of a Heisenbug on a fault-tolerant hardware configuration,
we'd be fairly unconcerned since the probability of the Heisenbug appearing on both the
primary and redundant subsystem is infinitesimally small. (See Jim Gray, "Why Do Computers Stop and What
Can Be Done About It?
When a Heisenbug is invoked, it's imperative that the software quickly detects the incorrect operation and fails so that it can be tried again. This is achieved through defensive programming, and validating inputs, intermediate results, and output. Additionally, processes are isolated and share no state with other processes.
This modular approach ensures that the scope of impact during failure is limited. Processes fail independently. When a process does fail, the software should use “process-pairs” to retry the work, meaning a new process can assume the work of a failed one. To maintain the reliability and integrity of the workload, each operation should be treated as an ACID transaction.
This allows a process to fail without corrupting the state of the workload by aborting the transaction and rolling back any changes made. This allows the recovery process to retry the transaction from a known-good state and restart gracefully. This is how software can be fault-tolerant to Heisenbugs.
However, you should not aim to make software fault-tolerant to Bohrbugs. These defects
must be found and removed before the workload enters production since no level of redundancy
will ever achieve correct outcome. (See Jim Gray, "Why Do Computers Stop and What
Can Be Done About It?
The final way to increase MTBF is to reduce the scope of impact from failure. Using
fault isolation through modularization to create fault containers is a primary
way to do so as outlined earlier in Fault tolerance and fault
isolation. Reducing the failure rate improves availability. AWS uses techniques
like dividing services into control planes and data planes, Availability Zone
Independence
For example, let's review a scenario where a workload placed customers into different fault containers of its infrastructure that serviced at most 5% of the total customers. One of these fault containers experiences an event that increased latency beyond the client timeout for 10% of requests. During this event, for 95% of customers, the service was 100% available. For the other 5%, the service appeared to be 90% available. This results in an availability of 1 − (5% of customers×10% of their requests) = 99.5% instead of 10% of requests failing for 100% of customers (resulting in a 90% availability).
Rule 11
Fault isolation decreases scope of impact and increases the MTBF of the workload by reducing the overall failure rate.
Increasing Dependency MTBF
The first method to increase your AWS dependency MTBF is through using fault isolation. Many AWS services offer a level of isolation at the AZ, meaning a failure in one AZ does not affect the service in a different AZ.
Using redundant EC2 instances in multiple AZs increases subsystem availability. AZI provides a sparing capability inside a single Region, allowing you to increase your availability for AZI services.
However, not all AWS services operate at the AZ level. Many others offer regional isolation. In this case, where the designed-for availability of the regional service doesn't support the overall availability required for your workload, you might consider a multi-Region approach. Each Region offers an isolated instantiation of the service, equivalent to sparing.
There are various services that help make building a multi-Region service easier. For example:
This document doesn't delve into the strategies of building multi-Region workloads, but you should weigh the availability benefits of multi-Region architectures with the additional cost, complexity, and operational practices they require to meet your desired availability goals.
The next method to increase dependency MTBF is by designing your workload to be statically stable. For example, you have a workload that serves product information. When your customers make a request for a product, your service makes a request to an external metadata service to retrieve product details. Then your workload returns all of that info to the user.
However, if the metadata service is unavailable, the requests made by your customers fail. Instead, you can asynchronously pull or push the metadata locally to your service to be used to answer requests. This eliminates the synchronous call to the metadata service from your critical path.
Additionally, because your service is still available even when the metadata service is
not, you can remove it as a dependency in your availability calculation. This example is
dependent on the assumption that the metadata doesn’t change frequently and that serving
stale metadata is better than the request failing. Another similar example is serve-stale
The final method to increase dependency MTBF is to reduce the scope of impact from failure. As discussed earlier, failure is not a binary event, there are degrees of failure. This is the effect of modularization; failure is contained to just the requests or users being serviced by that container.
This results in fewer failures during an event which ultimately increases availability of the overall workload by limiting the scope of impact.
Reducing common sources of impact
In 1985, Jim Gray discovered, during a study at Tandem Computers, that failure was
primarily driven by two things: software and operations. (See Jim Gray, "Why Do Computers Stop and What
Can Be Done About It?
Stability compared with features
If we refer back to the failure rates for software and hardware graph in the section Distributed system availability, we can notice that defects are added in each software release. This means that any change to the workload introduces increased risk of failure. These changes are typically things like new features, which provides a corollary. Higher availability workloads will favor stability over new features. Thus, one of the simplest ways to improve availability is to deploy less often or deliver fewer features. Workloads that deploy more frequently will inherently have a lower availability than those that do not. However, workloads that fail to add features do not keep up with customer demand and can become less useful over time.
So, how do we continue to innovate and release features safely? The answer is standardization. What is the correct way to deploy? How do you order deployments? What are the standards for testing? How long do you wait between stages? Do your unit tests cover enough of the software code? These are questions that standardization will answer and prevent issues caused by things like not load testing, skipping deployment stages, or deploying too quickly to too many hosts.
The way that you implement standardization is through automation. It reduces the chance of human mistakes and lets computers do the thing they're good at, which is doing the same thing over and over the same way every time. The way you stick standardization and automation together is to set goals. Goals like no manual changes, host access only through contingent authorization systems, writing load tests for every API, and so on. Operational excellence is a cultural norm that can require substantial change. Establishing and tracking performance against a goal helps drive cultural change that will have a broad impact on workload availability. The AWS Well-Architected Operational Excellence pillar provides comprehensive best practices for operational excellence.
Operator safety
The other major contributor to operational events that introduce failure are people. Humans make mistakes. They might use the wrong credentials, enter the wrong command, press Enter too soon, or miss a critical step. Taking manual action consistently results in error, resulting in failure.
One of the major causes for operator errors are confusing, unintuitive, or
inconsistent user interfaces. Jim Gray also noted in his 1985 study that “interfaces that
ask the operator for information or ask him to perform some function must be simple,
consistent, and operator fault-tolerant.” (See Jim Gray, "Why Do Computers Stop and
What Can Be Done About It?
Rule 12
Make it easy for operators to do the right thing.
Preventing overload
The final common contributor of impact is your customers, the actual users of your workload. Successful workloads tend to get used, a lot, but sometimes that usage outpaces the workload’s ability to scale. There are many things that can happen, disks can become full, thread pools might get exhausted, network bandwidth might be saturated, or database connection limits can be reached.
There is no failproof method to eliminate these, but proactive monitoring of capacity
and utilization through Operational Health metrics will provide early warnings when these
failures might occur. Techniques like load-shedding
If you need to ensure the continuously available capacity for customers, you have to make tradeoffs on availability and cost. One way to ensure lack of capacity doesn’t lead to unavailability is to provide each customer with a quota and ensure your workload’s capacity is scaled to provide 100% of the allocated quotas. When customers exceed their quota, they get throttled, which isn't a failure and doesn’t count against availability. You will also need to closely track your customer base and forecast future utilization to keep enough capacity provisioned. This ensures your workload isn't driven to failure scenarios through over consumption by your customers.
For example, let’s examine a workload that provides a storage service. Each server in the workload can support 100 downloads per second, customers are provided a quota or 200 downloads per second, and there are 500 customers. To be able to support this volume of customers, the service needs to provide capacity for 100,000 downloads per second, which requires 1,000 servers. If any customer exceeds their quota, they get throttled, which ensures sufficient capacity for every other customer. This is a simple example of one way to avoid overload without rejecting units of work.