OPS08-BP04 Create actionable alerts - AWS Well-Architected Framework

OPS08-BP04 Create actionable alerts

Promptly detecting and responding to deviations in your application's behavior is crucial. Especially vital is recognizing when outcomes based on key performance indicators (KPIs) are at risk or when unexpected anomalies arise. Basing alerts on KPIs ensures that the signals you receive are directly tied to business or operational impact. This approach to actionable alerts promotes proactive responses and helps maintain system performance and reliability.

Desired outcome: Receive timely, relevant, and actionable alerts for rapid identification and mitigation of potential issues, especially when KPI outcomes are at risk.

Common anti-patterns:

  • Setting up too many non-critical alerts, leading to alert fatigue.

  • Not prioritizing alerts based on KPIs, making it hard to understand the business impact of issues.

  • Neglecting to address root causes, leading to repetitive alerts for the same issue.

Benefits of establishing this best practice:

  • Reduced alert fatigue by focusing on actionable and relevant alerts.

  • Improved system uptime and reliability through proactive issue detection and mitigation.

  • Enhanced team collaboration and quicker issue resolution by integrating with popular alerting and communication tools.

Level of risk exposed if this best practice is not established: High

Implementation guidance

To create an effective alerting mechanism, it's vital to use metrics, logs, and trace data that flag when outcomes based on KPIs are at risk or anomalies are detected.

Implementation steps

  1. Determine key performance indicators (KPIs): Identify your application's KPIs. Alerts should be tied to these KPIs to reflect the business impact accurately.

  2. Implement anomaly detection:

  3. Implement actionable alerts: Design alerts that provide adequate information for immediate action.

  4. Reduce alarm fatigue: Minimize non-critical alerts. Overwhelming teams with numerous insignificant alerts can lead to oversight of critical issues and diminish the overall effectiveness of the alerting mechanism.

  5. Set up composite alarms: Use Amazon CloudWatch composite alarms to consolidate multiple alarms.

  6. Integrate with alerting tools: Incorporate tools like Ops Genie and PagerDuty.

  7. Engage Amazon Q Developer in chat applications Integrate Amazon Q Developer in chat applicationsto relay alerts to Chime, Microsoft Teams, and Slack.

  8. Alert based on logs: Use log metric filters in CloudWatch to create alarms based on specific log events.

  9. Review and iterate: Regularly revisit and refine alert configurations.

Level of effort for the implementation plan: Medium

Resources

Related best practices:

Related documents:

Related videos:

Related examples: