PERF04-BP03 Collect and record database performance metrics - AWS Well-Architected Framework (2023-04-10)

PERF04-BP03 Collect and record database performance metrics

To understand how your data management systems are performing, it is important to track relevant metrics. These metrics will help you to optimize your data management resources, to ensure that your workload requirements are met, and that you have a clear overview on how the workload performs. Use tools, libraries, and systems that record performance measurements related to database performance.

There are metrics that are related to the system on which the database is being hosted (for example, CPU, storage, memory, IOPS), and there are metrics for accessing the data itself (for example, transactions per second, queries rates, response times, errors). These metrics should be readily accessible for any support or operational staff, and have sufficient historical record to be able to identify trends, anomalies, and bottlenecks.

Desired outcome: To monitor the performance of your database workloads, you must record multiple performance metrics over a period of time. This allows you to detect anomalies as well as measure performance against business metrics to ensure you are meeting your workload needs.

Common anti-patterns:

  • You only use manual log file searching for metrics.

  • You only publish metrics to internal tools used by your team and don’t have a comprehensive picture of your workload.

  • You only use the default metrics recorded by your selected monitoring software.

  • You only review metrics when there is an issue.

  • You only monitor system level metrics, not capturing data access or usage metrics.

Benefits of establishing this best practice: Establishing a performance baseline helps in understanding normal behavior and requirements of workloads. Abnormal patterns can be identified and debugged faster improving performance and reliability of the database. Database capacity can be configured to ensure optimal cost without compromising performance.

Level of risk exposed if this best practice is not established: High

  • Inability to differentiate out of normal vs. normal performance level will create difficulties in issue identification, and decision making.

  • Potential cost savings may not be identified.

  • Growth patterns will not be identified which might result in reliability or performance degradation.

Implementation guidance

Identify, collect, aggregate, and correlate database-related metrics. Metrics should include both the underlying system that is supporting the database and the database metrics. The underlying system metrics might include CPU utilization, memory, available disk storage, disk I/O, and network inbound and outbound metrics while the database metrics might include transactions per second, top queries, average queries rates, response times, index usage, table locks, query timeouts, and number of connections open. This data is crucial to understand how the workload is performing and how the database solution is used. Use these metrics as part of a data-driven approach to tune and optimize your workload's resources. 

Implementation steps:

  1. Which database metrics are important to track?

  2. Would the database monitoring benefit from a machine learning solution that detects operational anomalies performance issues?

    1. Amazon DevOps Guru for Amazon RDS provides visibility into performance issues and makes recommendations for corrective actions.

  3. Do you need application level details about SQL usage?

    1. AWS X-Ray can be instrumented into the application to gain insights and encapsulate all the data points for single query.

  4. Do you currently have an approved logging and monitoring solution?

    1. Amazon CloudWatch can collect metrics across the resources in your architecture. You can also collect and publish custom metrics to surface business or derived metrics. Use CloudWatch or third-party solutions to set alarms that indicate when thresholds are breached.

  5. You identified and configured your data retention policies to match my security and operational goals?

Level of effort for the implementation plan: There is a medium level of effort to identify, track, collect, aggregate, and correlate metrics from all database resources.

Resources

Related documents:

Related videos:

Related examples: