Appendix B: Example ORR questions - Operational Readiness Reviews (ORR)

Appendix B: Example ORR questions

These are a few example questions that you can use to get started building your own ORR program.

Architecture

What have you included in your architectural design to reduce the blast radius of failures?

Please provide a table enumerating all customer impacting APIs, an explanation for what each does, and the components and dependencies of your service that it touches. Include all APIs whether they are public or private from the customer’s perspective.

Please construct a failure model listing soft and hard failure modes for each of your components and dependencies.

Guidance

Your failure model should include columns for Component or Dependency, Failure Type, Service Impact, and Customer Impact.

Please address an outage of your service in its largest blast radius unit, such as a cell, an Availability Zone (AZ), or a Region, plus a total infrastructure outage in its largest blast radius (an AZ).

What is the retry and back-off strategy for each of your dependencies?

Guidance

For dependency calls made within the context of a synchronous API call, you should generally retry once immediately, then give up.

For dependency calls made within the context of work requested through an asynchronous API call, you should generally exponentially back off and retry with jitter for retryable failures.

Have you intentionally set appropriate retry and socket timeout configuration for all SDK usage?

Guidance

Not setting the appropriate retry and timeout logic for your AWS SDK clients can lead to a thread pool with all threads engaged in dependency operations. It's better to fail fast and a return a response to the client for dependency calls made within the context of sync calls from customers to let the client decide how and when to retry then timeout customer requests.

What throttling techniques are you using to defensively protect your service from customers?

Guidance

See Fairness in multi-tenant systems

Are you using distributed throttling on your front-end?

Do you have pre-authentication throttles?

Are limits on request size enforced before authentication?

If your service was temporarily deactivated or shut down, what is your recovery time objective (RTO) for restarting your service?

Guidance

Have you practiced it? Do you have a runbook for restarting your service? Are there any off-box dependencies that are mandatory for restart? Have you confirmed that there are no circular dependencies?

Release quality

Do your customer impacting deployments automatically rollback incorrect deployments before they breach your internal SLAs?

Guidance

AWS CodeDeploy: Automatic Rollback on Amazon CloudWatch Alarm

AWS Lambda: Possible with Traffic Shifting using CodeDeploy, supported in Serverless Application Model. Specify a list of CloudWatch Alarms in the DeployPreference property Lambda deployments with Automatic Rollback.

☐ Yes | No Risk

☐ No | High Risk

Do your customer impacting deployments run on-host validation tests to verify that the software has started successfully and is responding correctly to health checks on localhost before reregistering with the load balancer?

Guidance

CodeDeploy: Write validation scripts and execute them from the ValidateService lifecycle event hook

Lambda: Select Not Applicable

☐ Yes | No Risk

☐ No | High Risk

☐ Not Applicable

Do you have a mechanism in place to ensure all code changes (software, configuration, infrastructure, canary, and ops tools) to production systems are reviewed and approved by someone other than the code author?

Guidance

You may choose to add a manual approval action into your AWS CodePipeline pipelines and limit permissions to a set of approvers.

☐ Yes | No Risk

☐ No | High Risk

What is your load test plan?

Guidance

You should assume that you will find the breaking point of your service multiple times, iteratively addressing uncovered performance bottlenecks and repeating the load test. It should consider a small number of very large customers, a large number of very small customers, and sinusoidal load.

Conduct a load test that simulates a surge of traffic from a single customer to validate behavior under this kind of load. Conduct one large scale load test against your production environment (a) before you launch, and (b) subsequently during each quarter, to validate proper scaling as you grow (or for any potential peak usage).

Do you publish your canary synthetics errors to an independent metric? Subsequently, do you alarm on this metric?

Guidance

Ensure that your canary synthetics errors are published to their own metrics, as opposed to being combined with all errors. This allows your service to alarm on an individual canary error rate.

☐ Yes | No Risk

☐ No | High Risk

Event management

Have you performed a gameday to verify that your service's monitoring and alarming function as expected and your on-call engineers are engaged and able to rapidly diagnose and remediate failures?

Does your canary synthetic tests detect and alarm on shallow API test failures in under five minutes?

Do you monitor (and alarm on) your JVM statistics? Do you monitor (and alarm on) your hosts for file system, inode, and file descriptor utilization? Do you monitor (and alarm on) your hosts for CPU and memory utilization?

When do you look at your weekly and operator dashboards?

Guidance

Here’s an example schedule for an ops meeting:

  • Review outstanding action items from the previous week.

  • Review last week’s high severity tickets.

  • Review pipelines for things like rollbacks or blocks.

  • Review open customer support tickets.

  • Review open high severity tickets.

  • What new runbook entries were added this week?

  • Review the detailed metrics dashboard for one of your components.

  • Discuss on-call rotation.

Does your operational dashboard contain a view with metrics for critical dependencies?

Do your performance synthetics measure P50, P99, and P99.9s to track variability (including tail latency)?

Guidance

Performance variability should be measured along with median performance since there are edge cases which can affect both overall performance as well as the customer perception. Understanding this variability will allow your service to improve customer experience.

Do you have the ability to weight your workload out of an AZ within 15 minutes in the event of an issue, and do you have a runbook that clearly documents the process?

Guidance

Choose Not Applicable only if your workload has no AZ weight away requirements.

☐ Yes | No Risk

☐ No | High Risk

☐ Not Applicable

Would your workload be able to withstand the loss of an AZ without causing customer impact? Have you architected your workload to be statically stable during an AZ failure so that you do not have to make changes or deploy new capacity in response?

☐ Yes | No Risk

☐ No | High Risk