OPS08-BP03 Collect and analyze workload metrics - AWS Well-Architected Framework (2023-04-10)

OPS08-BP03 Collect and analyze workload metrics

Perform regular, proactive reviews of workload metrics to identify trends and determine if a response is necessary and validate the achievement of business outcomes. Aggregate metrics from your workload applications and components to a central location. Use dashboards and analytics tools to analyze telemetry and determine workload health. Implement a mechanism to conduct workload health reviews on periodic basis with stakeholders in your organization.

Desired outcome:

  • Workload metrics are collected in a central location.

  • Dashboards and analytics tools are used to analyze workload health trends.

  • You conduct periodic workload metric reviews with your organization.

Common anti-patterns:

  • Your organization collects metrics from the workload in two different observability platforms. You are unable to determine workload health because the platforms are incompatible.

  • Error rates for a component of your workload are slowly increasing. You fail to notice this trend because your organization does not conduct periodic workload metric reviews. The component fails after a week, impairing your workload.

Benefits of establishing this best practice:

  • You have increased awareness of workload health and the achievement of business outcomes.

  • Workload health trends can be developed over time.

Level of risk exposed if this best practice is not established: High

Implementation guidance

Collect workload metrics in a central location. Using dashboards and analytics tools, analyze workload metrics to gain insight into workload health, develop workload health trends, and validate the achievement of business outcomes. Implement a mechanism to conduct periodic reviews of workload metrics.

Customer example

AnyCompany Retail conducts workload metric reviews every week on Wednesday. They gather stakeholders from across the company and go through the previous week’s metrics. During the meeting, they highlight trends and insights gleaned from analytics tools. Internal dashboards are published with key workload metrics that any employee can view and search.

Implementation steps

  1. Identify the workload metrics that are tied to workload health. Starting with business KPIs, identify the metrics for applications, components, and platforms that provide an overall view of workload health.

    1. You can publish custom metrics to Amazon CloudWatch. You can leverage the Amazon CloudWatch agent to collect metrics and logs from Amazon EC2 instances and on-premises servers.

    2. The AWS Distro for OpenTelemetry can collect metrics from existing applications and be used to add new metrics.

    3. Customers with Enterprise Support can request the Building a Monitoring Strategy Workshop from their Technical Account Manager. This workshop helps you build an observability strategy for your workload.

  2. Collect workload metrics in a central platform. If workload metrics are split between different platform, this can make it difficult to analyze and develop trends. The platform should have dashboards and analytic capabilities.

    1. Amazon CloudWatch can collect and store workload metrics. In multi-account topologies, it is recommended to have a central logging and monitoring account, referred to as a log archive account.

  3. Build a consolidated dashboard of workload metrics. Use this view for metrics reviews and analysis of trends.

    1. You can create custom CloudWatch dashboards to collect your workload metrics in a consolidated view.

  4. Implement a workload metric review process. On a weekly, bi-weekly, or monthly basis, review your workload metrics with stakeholders, including technical and non-technical personnel. Use these review sessions to identify trends and gain insight into workload health.

Level of effort for the implementation plan: High. If workload metrics are not centrally collected, it could require significant investment to consolidate them in one platform.

Resources

Related best practices:

Related documents:

Related videos:

Related examples:

Related services: