OPS08-BP04 Create actionable alerts

Promptly detecting and responding to deviations in your application's behavior is crucial. Especially vital is recognizing when outcomes based on key performance indicators (KPIs) are at risk or when unexpected anomalies arise. Basing alerts on KPIs ensures that the signals you receive are directly tied to business or operational impact. This approach to actionable alerts promotes proactive responses and helps maintain system performance and reliability.

Desired outcome: Receive timely, relevant, and actionable alerts for rapid identification and mitigation of potential issues, especially when KPI outcomes are at risk.

Common anti-patterns:

Setting up too many non-critical alerts, leading to alert fatigue.
Not prioritizing alerts based on KPIs, making it hard to understand the business impact of issues.
Neglecting to address root causes, leading to repetitive alerts for the same issue.

Benefits of establishing this best practice:

Reduced alert fatigue by focusing on actionable and relevant alerts.
Improved system uptime and reliability through proactive issue detection and mitigation.
Enhanced team collaboration and quicker issue resolution by integrating with popular alerting and communication tools.

Level of risk exposed if this best practice is not established: High

Implementation guidance

To create an effective alerting mechanism, it's vital to use metrics, logs, and trace data that flag when outcomes based on KPIs are at risk or anomalies are detected.

Implementation steps

Determine key performance indicators (KPIs): Identify your application's KPIs. Alerts should be tied to these KPIs to reflect the business impact accurately.
Implement anomaly detection:
- Use Amazon CloudWatch anomaly detection: Set up Amazon CloudWatch anomaly detection to automatically detect unusual patterns, which helps you only generate alerts for genuine anomalies.
- Use AWS X-Ray Insights:
  1. Set up X-Ray Insights to detect anomalies in trace data.
  2. Configure notifications for X-Ray Insights to be alerted on detected issues.
- Integrate with Amazon DevOps Guru:
  1. Leverage Amazon DevOps Guru for its machine learning capabilities in detecting operational anomalies with existing data.
  2. Navigate to the notification settings in DevOps Guru to set up anomaly alerts.
Implement actionable alerts: Design alerts that provide adequate information for immediate action.
1. Monitor AWS Health events with Amazon EventBridge rules, or integrate programatically with the AWS Health API to automate actions when you receive AWS Health events. These can be general actions, such as sending all planned lifecycle event messages to a chat interface, or specific actions, such as the initiation of a workflow in an IT service management tool.
Reduce alert fatigue: Minimize non-critical alerts. When teams are overwhelmed with numerous insignificant alerts, they can lose oversight of critical issues, which diminishes the overall effectiveness of the alert mechanism.
Set up composite alarms: Use Amazon CloudWatch composite alarms to consolidate multiple alarms.
Integrate with alert tools: Incorporate tools like Ops Genie and PagerDuty.
Engage AWS Chatbot: Integrate AWS Chatbot to relay alerts to Amazon Chime, Microsoft Teams, and Slack.
Alert based on logs: Use log metric filters in CloudWatch to create alarms based on specific log events.
Review and iterate: Regularly revisit and refine alert configurations.

Level of effort for the implementation plan: Medium

Resources

Related best practices:

Related documents:

Related videos:

Related examples:

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

OPS08-BP03 Analyze workload traces

OPS08-BP05 Create dashboards