REL11-BP06 Send notifications when events impact availability
Notifications are sent upon the detection of significant events, even if the issue caused by the event was automatically resolved.
Automated healing allows your workload to be reliable. However, it can also obscure underlying problems that need to be addressed. Implement appropriate monitoring and events so that you can detect patterns of problems, including those addressed by auto healing, so that you can resolve root cause issues. Amazon CloudWatch Alarms can be invoked based on failures that occur. They can also be invoked based on automated healing actions that run. CloudWatch Alarms can be configured to send emails, or to log incidents in third-party incident tracking systems using Amazon SNS integration.
Common anti-patterns:
-
Sending alarms that no one acts upon.
-
Performing auto healing automation, but not notifying that healing was needed.
Benefits of establishing this best practice: Notifications of recovery events will ensure that you don’t ignore problems that occur infrequently.
Level of risk exposed if this best practice is not established: Medium
Implementation guidance
Alarms on business Key Performance Indicators when they exceed a low threshold Having a low threshold alarm on your business KPIs help you know when your workload is unavailable or non-functional.
-
Alarm on events that invoke healing automation You can directly invoke an SNS API to send notifications with any automation that you create.
Resources
Related documents: