OPS09-BP06 Alert when operations outcomes are at risk
Whenever operations outcomes are at risk, an alert must be raised and acted upon. Operations outcomes are any activity that supports a workload in production. This includes everything from deploying new versions of applications to recovering from an outage. Operations outcomes must be treated with the same importance as business outcomes.
Software teams should identify key operations metrics and activities and build alerts for them. Alerts must be timely and actionable. If an alert is raised, a reference to a corresponding runbook or playbook should be included. Alerts without a corresponding action can lead to alert fatigue.
Desired outcome: When operations activities are at risk, alerts are sent to drive action. The alerts contain context on why an alert is being raised and point to a playbook to investigate or a runbook to mitigate. Where possible, runbooks are automated and notifications are sent.
Common anti-patterns:
-
You are investigating an incident and support cases are being filed. The support cases are breaching the service level agreement (SLA) but no alerts are being raised.
-
A deployment to production scheduled for midnight is delayed due to last-minute code changes. No alert is raised and the deployment hangs.
-
A production outage occurs but no alerts are sent.
-
Your deployment time consistently runs behind estimates. No action is taken to investigate.
Benefits of establishing this best practice:
-
Alerting when operations outcomes are at risk boosts your ability to support your workload by staying ahead of issues.
-
Business outcomes are improved due to healthy operations outcomes.
-
Detection and remediation of operations issues are improved.
-
Overall operational health is increased.
Level of risk exposed if this best practice is not established: Medium
Implementation guidance
Operations outcomes must be defined before you can alert on them. Start by defining what operations activities are most important to your organization. Is it deploying to production in under two hours or responding to a support case within a set amount of time? Your organization must define key operations activities and how they are measured so that they can be monitored, improved, and alerted on. You need a central location where workload and operations telemetry is stored and analyzed. The same mechanism should be able to raise an alert when an operations outcome is at risk.
Customer example
A CloudWatch alarm was initiated during a routine deployment at AnyCompany Retail. The lead time for deployment was breached. Amazon EventBridge created an OpsItem in AWS Systems Manager OpsCenter. The Cloud Operations team used a playbook to investigate the issue and identified that a schema change was taking longer than expected. They alerted the on-call developer and continued monitoring the deployment. Once the deployment was complete, the Cloud Operations team resolved the OpsItem. The team will analyze the incident during a postmortem.
Implementation steps
-
If you have not identified operations KPIs, metrics, and activities, work on implementing the preceding best practices to this question (OPS09-BP01 to OPS09-BP05).
-
AWS Support customers with Enterprise Support
can request the Operations KPI Workshop from their Technical Account Manager. This collaborative workshop helps you define operations KPIs and metrics aligned to business goals, provided at no additional cost. Contact your Technical Account Manager to learn more.
-
-
Once you have operations activities, KPIs, and metrics established, configure alerts in your observability platform. Alerts should have an action associated to them, like a playbook or runbook. Alerts without an action should be avoided.
-
Over time, you should evaluate your operations metrics, KPIs, and activities to identify areas of improvement. Capture feedback in runbooks and playbooks from operators to identify areas for improvement in responding to alerts.
-
Alerts should include a mechanism to flag them as a false-positive. This should lead to a review of the metric thresholds.
Level of effort for the implementation plan: Medium. There are several best practices that must be in place before implementing this best practice. Once operations activities have been identified and operations KPIs established, alerts should be established.
Resources
Related best practices:
-
OPS02-BP03 Operations activities have identified owners responsible for their performance: Every operation activity and outcome should have an identified owner that's responsible. This is who should be alerted when outcomes are at risk.
-
OPS03-BP02 Team members are empowered to take action when outcomes are at risk: When alerts are raised, your team should have agency to act to remedy the issue.
-
OPS09-BP01 Identify key performance indicators: Alerting on operations outcomes starts with identify operations KPIs.
-
OPS09-BP02 Define operations metrics: Establish this best practice before you start generating alerts.
-
OPS09-BP03 Collect and analyze operations metrics: Centrally collecting operations metrics is required to build alerts.
-
OPS09-BP04 Establish operations metrics baselines: Operations metrics baselines provide the ability to tune alerts and avoid alert fatigue.
-
OPS09-BP05 Learn the expected patterns of activity for operations: You can improve the accuracy of your alerts by understanding the activity patterns for operations events.
-
OPS09-BP08 Validate the achievement of outcomes and the effectiveness of KPIs and metrics: Evaluate the achievement of operations outcomes to ensure that your KPIs and metrics are valid.
-
OPS10-BP02 Have a process per alert: Every alert should have an associated runbook or playbook and provide context for the person being alerted.
-
OPS11-BP02 Perform post-incident analysis: Conduct a post-incident analysis after the alert to identify areas for improvement.
Related documents:
Related videos:
Related examples:
Related services: