PERF02-BP03 Collect compute-related metrics
To understand how your compute resources are performing, you must record and track the utilization of various systems. This data can be used to make more accurate determinations about resource requirements.
Workloads can generate large volumes of data such as metrics, logs, and events. Determine if your existing storage, monitoring, and observability service can manage the data generated. Identify which metrics reflect resource utilization and can be collected, aggregated, and correlated on a single platform across. Those metrics should represent all your workload resources, applications, and services, so you can easily gain system-wide visibility and quickly identify performance improvement opportunities and issues.
Desired outcome: All metrics related to the compute-related resources are identified, collected, aggregated, and correlated on a single platform with retention implemented to support cost and operational goals.
Common anti-patterns:
-
You only use manual log file searching for metrics.
-
You only publish metrics to internal tools.
-
You only use the default metrics recorded by your selected monitoring software.
-
You only review metrics when there is an issue.
Benefits of establishing this best practice: To monitor the performance of your workloads, you must record multiple performance metrics over a period of time. These metrics allow you to detect anomalies in performance. They will also help gauge performance against business metrics to ensure that you are meeting your workload needs.
Level of risk exposed if this best practice is not established: High
Implementation guidance
Identify, collect, aggregate, and correlate compute-related metrics. Using a service such as Amazon CloudWatch, can make the implementation quicker and easier to maintain. In addition to the default metrics recorded, identify and track additional system-level metrics within your workload. Record data such as CPU utilization, memory, disk I/O, and network inbound and outbound metrics to gain insight into utilization levels or bottlenecks. This data is crucial to understand how the workload is performing and how the compute solution is utilized. Use these metrics as part of a data-driven approach to actively tune and optimize your workload's resources.
Implementation steps:
-
Which compute solution metrics are important to track?
-
Do I currently have an approved logging and monitoring solution?
-
Have I identified and configured my data retention policies to match my security and operational goals?
-
How do you deploy your metric and log aggregation agents?
Level of effort for the Implementation Plan: There is a medium level of effort to identify, track, collect, aggregate, and correlate metrics from all compute resources.
Resources
Related documents:
Related videos:
Related examples: