Stage 3: Inspect, adapt and iterate - AWS Prescriptive Guidance

Stage 3: Inspect, adapt and iterate

After you implement your observability system, we recommend that you continually review, assess, learn, adapt, and improve your implementation. You can use the AWS Observability Maturity Model as a tool to assess the maturity of your implementation and to identify and prioritize areas for improvement.

Implement regular reviews

Observability is an iterative process. It requires regular audits and assessments of existing components, and changes and enhancements to drive continual improvement. We recommend that you perform regular reviews to reevaluate SLOs, alert thresholds, dashboards, metric granularity, retention policies, sampling strategies, and so on to ensure that these are driving value for your teams and business. By connecting observability costs to specific teams and services, you can enable data-driven decisions about coverage and resource allocation.

At Amazon, we conduct weekly Operational Readiness Reviews (ORRs) to audit teams' processes and observability postures against best practices. This is a non-blocking exercise that aligns with the number of services and frequency of releases at Amazon.

Depending on the size of your organization, you can also have a business as usual (BAU) roster, where one member of each team is responsible for reporting on anomalies and trends, uncovering unknown-unknowns, removing unwanted instrumentation and alerts, improving dashboards, and ensuring that the observability solution continues to work for the team and is aligned to the team's objectives and success metrics. This could also be an opportunity to reassess the alerting strategy to be more responsive, proactive, and closer to the user. The goal with these reviews is to create a virtuous cycle, as shown in the following illustration, and to improve the maturity of your observability posture maturity, as described in the AWS Observability Maturity Model.

Feedback and review cycle in the iterative observability process.

Identify the playbooks that are accessed most frequently and consider improving your application or adding more instrumentation. Identify the runbooks that are executed most frequently and consider automating those runbooks.

The learnings from these reviews are also shared with the observability squad and specialists, to highlight improvements in central programs and the observability platform. For example, depending on the frequency of deployment-triggered events, you might decide to prioritize the improvement of the deployment pipeline over other components. If the MTTR is higher because of monitoring gaps, you can prioritize improving the observability platform and its configuration.

Celebrate wins

Share success stories from teams that use observability tools. For example, highlight the success of a team that used observability metrics to implement an alternative solution that is more efficient and leads to lower latency or cost. Communicating this success underscores the importance of observability and motivates other teams to improve their observability posture and strive for similar success.

Learn from incidents

Conduct blameless post-incident exercises similar to the correction of errors (COE) process at Amazon to identify areas for improvement and to prevent future issues. As with wins, the learnings from this exercise can be shared broadly with other teams to reinforce the value of observability and best practices.