OPS08-BP04 Establish workload metrics baselines
Establishing a baseline for workload metrics aids in understanding workload health and performance. Using baselines, you can identify under- and over-performing applications and components. A workload baseline adds to your ability to mitigate issues before they become incidents. Baselines are foundational in developing patterns of activity and implementing anomaly detection when metrics deviate from expected values.
Desired outcome:
-
You have a baseline level of metrics for your workload under normal conditions.
-
You can determine if your workload is functioning normally.
Common anti-patterns:
-
After deploying a new feature, there is drop in request latency. A baseline was not established for a composite metric of incoming processed requests and overall latency. You are unable to determine if the change caused an improvement or caused a defect.
-
A sudden spike in user activity occurs, but you have not established a metric baseline. The activity spike slowly leads to a memory leak in an application. Eventually this takes your workload offline.
Benefits of establishing this best practice:
-
You understand the normal pattern of activity for your workload using metrics for key components and applications.
-
You can determine if your workload, its applications, and components, are behaving normally or may require intervention.
Level of risk exposed if this best practice is not established: Medium
Implementation guidance
Use historical data to establish a baseline of workload metrics for applications and components in your workload. Leverage the metric baseline in metric review meetings and troubleshooting. Periodically review workload performance and adjust the baseline as the architecture evolves.
Customer example
Baselines are established for all components and applications at AnyCompany Retail. Using historical data, AnyCompany Retail developed their workload metric baselines over a two-month metric window. Every two months they re-assess baselines and adjust them based on real-world data.
Implementation steps
-
Working backwards from your workload metrics, establish a metric baseline for key components and applications using historical data. Limit the number of metrics per component or application and avoid monitor fatigue.
-
You can use Amazon CloudWatch Metrics Insights to query metrics at scale and identify trends and patterns.
-
Amazon CloudWatch anomaly detection uses machine learning algorithms to identify patterns of behavior for metrics, determine baselines, and surfaces anomalies.
-
Amazon DevOpsĀ Guru provides the ability to detect operational issues with your workload using machine learning.
-
Customers with Enterprise Support can request the Building a Monitoring Strategy Workshop
from their Technical Account Manager. This workshop will help you build an observability strategy for your workload.
-
-
Put in place a mechanism to periodically review workload metric baselines, especially before significant business events. At least once a quarter, evaluate your workload metric baseline using historical data. Use the baseline in your metric review meetings.
Level of effort for the implementation plan: Low. Having established workload metrics, establishing baselines may require you to collect enough data to identify normal patterns of behavior.
Resources
Related best practices:
-
OPS08-BP02 Define workload metrics - Workload metrics must be established first before determining baselines.
-
OPS08-BP03 Collect and analyze workload metrics - Collecting and analyzing workload metrics is necessary to have in place before establishing metric baselines.
-
OPS08-BP05 Learn expected patterns of activity for workload - This best practice builds on top of the baseline to develop usage trends.
-
OPS08-BP06 Alert when workload outcomes are at risk - Metric baselines are necessary to identifying thresholds and developing alerts.
-
OPS08-BP07 Alert when workload anomalies are detected - Anomaly detection requires the establishment of metric baselines.
Related documents:
Related videos:
-
AWS re:Invent 2020: Monitoring production services at Amazon
-
AWS re:Invent 2021- Get insights from operational metrics at scale with CloudWatch Metrics Insights
-
AWS re:Invent 2022 - Developing an observability strategy (COP302)
-
AWS Summit DC 2022 - Monitoring and observability for modern applications
-
AWS Summit SF 2022 - Full-stack observability and application monitoring with AWS (COP310)
Related examples:
Related services: