OPS08-BP04 Create actionable alerts - Operational Excellence Pillar

OPS08-BP04 Create actionable alerts

Promptly detecting and responding to deviations in your application's behavior is crucial. Especially vital is recognizing when outcomes based on key performance indicators (KPIs) are at risk or when unexpected anomalies arise. Basing alerts on KPIs ensures that the signals you receive are directly tied to business or operational impact. This approach to actionable alerts promotes proactive responses and helps maintain system performance and reliability.

Desired outcome: Receive timely, relevant, and actionable alerts for rapid identification and mitigation of potential issues, especially when KPI outcomes are at risk.

Common anti-patterns:

  • Setting up too many non-critical alerts, leading to alert fatigue.

  • Not prioritizing alerts based on KPIs, making it hard to understand the business impact of issues.

  • Neglecting to address root causes, leading to repetitive alerts for the same issue.

Benefits of establishing this best practice:

  • Reduced alert fatigue by focusing on actionable and relevant alerts.

  • Improved system uptime and reliability through proactive issue detection and mitigation.

  • Enhanced team collaboration and quicker issue resolution by integrating with popular alerting and communication tools.

Level of risk exposed if this best practice is not established: High

Implementation guidance

To create an effective alerting mechanism, it's vital to use metrics, logs, and trace data that flag when outcomes based on KPIs are at risk or anomalies are detected.

Implementation steps

  1. Determine key performance indicators (KPIs): Identify your application's KPIs. Alerts should be tied to these KPIs to reflect the business impact accurately.

  2. Implement anomaly detection:

  3. Implement actionable alerts: Design alerts that provide adequate information for immediate action.

    1. Monitor AWS Health events with Amazon EventBridge rules, or integrate programatically with the AWS Health API to automate actions when you receive AWS Health events. These can be general actions, such as sending all planned lifecycle event messages to a chat interface, or specific actions, such as the initiation of a workflow in an IT service management tool.

  4. Reduce alert fatigue: Minimize non-critical alerts. When teams are overwhelmed with numerous insignificant alerts, they can lose oversight of critical issues, which diminishes the overall effectiveness of the alert mechanism.

  5. Set up composite alarms: Use Amazon CloudWatch composite alarms to consolidate multiple alarms.

  6. Integrate with alert tools: Incorporate tools like Ops Genie and PagerDuty.

  7. Engage AWS Chatbot: Integrate AWS Chatbot to relay alerts to Amazon Chime, Microsoft Teams, and Slack.

  8. Alert based on logs: Use log metric filters in CloudWatch to create alarms based on specific log events.

  9. Review and iterate: Regularly revisit and refine alert configurations.

Level of effort for the implementation plan: Medium

Resources

Related best practices:

Related documents:

Related videos:

Related examples: