REL05-BP07 Implement emergency levers - Reliability Pillar

REL05-BP07 Implement emergency levers

This best practice was updated with new guidance on December 6, 2023.

Emergency levers are rapid processes that can mitigate availability impact on your workload.

Emergency levers work by disabling, throttling, or changing the behavior of components or dependencies using known and tested mechanisms. This can alleviate workload impairments caused by resource exhaustion due to unexpected increases in demand and reduce the impact of failures in non-critical components within your workload.

Desired outcome: By implementing emergency levers, you can establish known-good processes to maintain the availability of critical components in your workload. The workload should degrade gracefully and continue to perform its business-critical functions during the activation of an emergency lever. For more detail on graceful degradation, see REL05-BP01 Implement graceful degradation to transform applicable hard dependencies into soft dependencies.

Common anti-patterns:

  • Failure of non-critical dependencies impacts the availability of your core workload.

  • Not testing or verifying critical component behavior during non-critical component impairment.

  • No clear and deterministic criteria defined for activation or deactivation of an emergency lever.

Benefits of establishing this best practice: Implementing emergency levers can improve the availability of the critical components in your workload by providing your resolvers with established processes to respond to unexpected spikes in demand or failures of non-critical dependencies.

Level of risk exposed if this best practice is not established: Medium

Implementation guidance

  • Identify critical components in your workload.

  • Design and architect the critical components in your workload to withstand failure of non-critical components.

  • Conduct testing to validate the behavior of your critical components during the failure of non-critical components.

  • Define and monitor relevant metrics or triggers to initiate emergency lever procedures.

  • Define the procedures (manual or automated) that comprise the emergency lever.

Implementation steps

  • Identify business-critical components in your workload.

  • Design and architect the critical components in your workload to withstand failure of non-critical components.

    • During dependency analysis, consider all potential failure modes, and verify that your emergency lever mechanisms deliver the critical functionality to downstream components.

  • Conduct testing to validate the behavior of your critical components during activation of your emergency levers.

  • Define, monitor, and alert on relevant metrics to initiate the emergency lever procedure.

    • Finding the right metrics to monitor depends on your workload. Some example metrics are latency or the number of failed request to a dependency.

  • Define the procedures, manual or automated, that comprise the emergency lever.

Resources

Related best practices:

Related documents:

Related videos: