PERF05-BP07 Review metrics at regular intervals - AWS Well-Architected Framework

PERF05-BP07 Review metrics at regular intervals

As part of routine maintenance or in response to events or incidents, review which metrics are collected. Use these reviews to identify which metrics were essential in addressing issues and which additional metrics, if they were being tracked, could help identify, address, or prevent issues.

Common anti-patterns:

  • You allow metrics to stay in an alarm state for an extended period of time.

  • You create alarms that are not actionable by an automation system.

Benefits of establishing this best practice: Continually review metrics that are being collected to verify that they properly identify, address, or prevent issues. Metrics can also become stale if you let them stay in an alarm state for an extended period of time.

Level of risk exposed if this best practice is not established: Medium

Implementation guidance

Constantly improve metric collection and monitoring. As part of responding to incidents or events, evaluate which metrics were helpful in addressing the issue and which metrics could have helped that are not currently being tracked. Use this method to improve the quality of metrics you collect so that you can prevent, or more quickly resolve future incidents.

As part of responding to incidents or events, evaluate which metrics were helpful in addressing the issue and which metrics could have helped that are not currently being tracked. Use this to improve the quality of metrics you collect so that you can prevent or more quickly resolve future incidents.

Implementation steps

  1. Define critical performance metrics to monitor that are aligned to your workload objective.

  2. Set a baseline and desirable value for each metric.

  3. Set a cadence (like weekly or monthly) to review critical metrics.

  4. During each review, assess trends and deviation from the baseline values. Look for any performance bottlenecks or anomalies.

  5. For identified issues, conduct in-depth root cause analysis to understand the main reason behind the issue.

  6. Document your findings and use strategies to deal with identified issues and bottlenecks.

  7. Continually assess and improve the metrics review process.

Resources

Related documents:

Related videos:

Related examples: