REL06-BP06 Conduct reviews regularly
Frequently review how workload monitoring is implemented and update it based on significant events and changes.
Effective monitoring is driven by key business metrics. Ensure these metrics are accommodated in your workload as business priorities change.
Auditing your monitoring helps ensure that you know when an application is meeting its availability goals. Root cause analysis requires the ability to discover what happened when failures occur. AWS provides services that allow you to track the state of your services during an incident:
-
Amazon CloudWatch Logs: You can store your logs in this service and inspect their contents.
-
Amazon CloudWatch Logs Insights: Is a fully managed service that allows you to analyze massive logs in seconds. It gives you fast, interactive queries and visualizations.
-
AWS Config: You can see what AWS infrastructure was in use at different points in time.
-
AWS CloudTrail: You can see which AWS APIs were invoked at what time and by what principal.
At AWS, we conduct a weekly meeting to
review
operational performance and to share learnings between teams.
Because there are so many teams in AWS, we
created The
Wheel
Common anti-patterns:
-
Collecting only default metrics.
-
Setting a monitoring strategy and never reviewing it.
-
Not discussing monitoring when major changes are deployed.
Benefits of establishing this best practice: Regularly reviewing your monitoring allows for the anticipation of potential problems, instead of reacting to notifications when an anticipated problem actually occurs.
Level of risk exposed if this best practice is not established: Medium
Implementation guidance
-
Create multiple dashboards for the workload. You must have a top-level dashboard that contains the key business metrics, as well as the technical metrics you have identified to be the most relevant to the projected health of the workload as usage varies. You should also have dashboards for various application tiers and dependencies that can be inspected.
-
Schedule and conduct regular reviews of the workload dashboards. Conduct regular inspection of the dashboards. You may have different cadences for the depth at which you inspect.
-
Inspect for trends in the metrics. Compare the metric values to historic values to see if there are trends that may indicate that something that needs investigation. Examples of this include: increasing latency, decreasing primary business function, and increasing failure responses.
-
Inspect for outliers/anomalies in your metrics. Averages or medians can mask outliers and anomalies. Look at the highest and lowest values during the time frame and investigate the causes of extreme scores. As you continue to eliminate these causes, lowering your definition of extreme allows you to continue to improve the consistency of your workload performance.
-
Look for sharp changes in behavior. An immediate change in quantity or direction of a metric may indicate that there has been a change in the application, or external factors that you may need to add additional metrics to track.
-
Resources
Related documents: