Appendix B ‒ Quantitative and qualitative measures - AWS Prescriptive Guidance

Appendix B ‒ Quantitative and qualitative measures

This section outlines quantitative metrics to track operational improvements and qualitative measures to assess broader organizational results from chaos engineering practices.

Quantitative measures

The following quantitative measures provide a framework for tracking key metrics that can demonstrate the direct incident and operational improvements achieved through chaos engineering practices:

  • Incidents:

    • Incident frequency ‒ Track the number of incidents within an incident classification framework and classify them by their criticality (critical, major, minor) over a period of time. For more information about the incident classification framework, see Appendix C.

    • Downtime and degradation ‒ Measure the total duration of downtime or service degradation for each incident classification.

    • Incident response metrics ‒ To understand incidents, measure Time to Detect, Time to Identification, Time to Mitigate, Time to Recover, Time to Escalation, and other related metrics for each incident classification.

    • Customer-impacting incidents ‒ Track the number of incidents that impact customers or the percentage of incidents that were contained before customer impact.

    • Runbook changes ‒ Track the number of runbook updates or revisions resulting from insights gained through chaos experiments. A runbook provides detailed instructions for performing a particular operation or procedure to recover from a particular type of incident.

  • Costs:

    • Infrastructure costs ‒ Collect data on infrastructure costs, including cloud computing resources and redundancy measures that are required by the actions taken to improve resilience.

    • Customer impact ‒ Measure impacts to the customer experience, churn rates, and revenue loss associated with system failures or downtime.

    • Staff productivity ‒ Track the time spent by engineering and operations teams on incident response, firefighting, writing postmortems, and other reactive tasks related to system failures.

  • Continuous system improvements ‒ Count the number of process improvements, architectural changes, or automated recovery mechanisms implemented as a direct result of insights from chaos experiments.

  • Compliance ‒ Track the costs and work to meet regulatory requirements or industry standards related to operational resilience.

  • Adoption ‒ Track the adoption rate of chaos practices across the organization.

  • Customer satisfaction ‒ Measure changes in customer satisfaction metrics to gauge how improved system reliability affects the business.

Qualitative measures

The following qualitative measures provide a framework for tracking the broader organizational results achieved through chaos engineering practices:

  • Employee confidence and preparedness:

    • Survey teams periodically to measure their confidence levels in handling real-world incidents and their perceived preparedness for on-call rotations.

    • Track the percentage of on-call engineers who have participated in chaos experiments as part of their training.

  • Cultural shift:

    • Assess the degree to which a resilience mindset has permeated the organization through surveys, feedback sessions, or audits.

    • Track the number of teams actively championing and advocating for chaos engineering practices.

  • Cross-functional collaboration and knowledge sharing:

    • Track the frequency and attendance of cross-team knowledge-sharing sessions or workshops related to chaos engineering learning.

    • Track the number of joint chaos engineering initiatives involving multiple teams or departments.

  • Training effectiveness:

    • Evaluate the effectiveness of chaos engineering training programs by conducting post-training surveys or assessments.

    • Track the number of engineers that participate in chaos engineering training programs and read postmortems.

  • Talent attraction and retention:

    • Evaluate whether the chaos engineering program helps attract and retain top engineering talent by reducing the time and effort spent on fixing outages.

  • Brand reputation:

    • Track any changes in brand perception or reputation related to the organization's demonstrated commitment to operational resilience.

  • Competitive advantage:

    • Track the competitive edge over industry peers in terms of system availability.