PERF02-BP03 Collect compute-related metrics - Performance Efficiency Pillar

PERF02-BP03 Collect compute-related metrics

Record and track compute-related metrics to better understand how your compute resources are performing and improve their performance and their utilization.

Common anti-patterns:

  • You only use manual log file searching for metrics. 

  • You only use the default metrics recorded by your monitoring software.

  • You only review metrics when there is an issue.

Benefits of establishing this best practice: Collecting performance-related metrics will help you align application performance with business requirements to ensure that you are meeting your workload needs. It can also help you continually improve the resource performance and utilization in your workload.

Level of risk exposed if this best practice is not established: High

Implementation guidance

Cloud workloads can generate large volumes of data such as metrics, logs, and events. In the AWS Cloud, collecting metrics is a crucial step to improve security, cost efficiency, performance, and sustainability. AWS provides a wide range of performance-related metrics using monitoring services such as Amazon CloudWatch to provide you with valuable insights. Metrics such as CPU utilization, memory utilization, disk I/O, and network inbound and outbound can provide insight into utilization levels or performance bottlenecks. Use these metrics as part of a data-driven approach to actively tune and optimize your workload's resources.  In an ideal case, you should collect all metrics related to your compute resources in a single platform with retention policies implemented to support cost and operational goals.

Implementation steps

  1. Identify which performance-related metrics are relevant to your workload. You should collect metrics around resource utilization and the way your cloud workload is operating (like response time and throughput).

  2. Choose and set up the right logging and monitoring solution for your workload.

  3. Define the required filter and aggregation for the metrics based on your workload requirements.

  4. Configure data retention policies for your metrics to match your security and operational goals.

  5. If required, create alarms and notifications for your metrics to help you proactively respond to performance-related issues.

  6. Use automation to deploy your metric and log aggregation agents.

Resources

Related documents:

Related videos:

Related examples: