General best practices - AWS Prescriptive Guidance

General best practices

The following best practices help you gain sufficient visibility into the health of your Amazon RDS workload and take appropriate actions in response to operational events and monitoring data.

  • Identify KPIs. Identify key performance indicators (KPIs) based on desired business outcomes. Evaluate KPIs to determine workload success. For example, if your core business is ecommerce, one of your desired business outcomes could be that your e-shop is available 24/7 for your customers to do their shopping. To achieve that business outcome, you define the availability KPI for the backend Amazon RDS database that your e-shop application uses, and set the baseline KPI to 99.99% on a weekly basis. Evaluating the actual availability KPI against the baseline value helps you determine if you are meeting the desired database availability of 99.99% and thus achieving the business outcome of having a 24/7 service.

  • Define workload metrics. Define workload metrics to measure the quantities and qualities of your Amazon RDS workload. Evaluate metrics to determine if the workload is achieving desired outcomes, and to understand the health of the workload. For example, to evaluate the availability KPI for your Amazon RDS DB instance, you should measure metrics such as uptime and downtime for the DB instance. You can then use those metrics to calculate the availability KPI as follows:

    availability = uptime / (uptime + downtime)

    Metrics represent time-ordered sets of data points. Metrics can also include dimensions, which are useful in categorization and analysis.

  • Collect and analyze workload metrics. Amazon RDS generates different metrics and logs, depending on your configuration. Some of these represent DB instance events, counters, or statistics such as db.Cache.innoDB_buffer_pool_hits. Other metrics come from the operating system, such as memory.Total, which measures the total amount of memory of the host Amazon Elastic Compute Cloud (Amazon EC2) instance. The monitoring tool should perform regular, proactive analysis of collected metrics to identify trends and determine if any appropriate responses are needed.

  • Establish workload metrics baselines. Establish baselines for metrics to define expected values and to identify good or bad thresholds. For example, you might define the baseline for ReadIOPS to be up to 1,000 under normal database operations. You can then use this baseline for comparison and to identify over-utilization. If your new metrics consistently show that read IOPS are in the range of 2,000‒3,000, you have identified a deviation that could trigger a response for investigation, intervention, and improvement.

  • Alert when workload outcomes are at risk. When you determine that the business outcome is at risk, raise an alert. You can then either address issues proactively, before they affect your customers, or mitigate the impact of the incident in a timely manner.

  • Identify the expected patterns of activity for your workload. Based on your metrics baselines, establish patterns of workload activity to identify unexpected behavior and respond with appropriate actions if necessary. AWS provides monitoring tools that apply statistical and machine learning algorithms to analyze metrics and detect anomalies.

  • Alert when workload anomalies are detected. When anomalies are detected in the operations of Amazon RDS workloads, raise an alert so that you can respond with appropriate actions if necessary.

  • Review and revise KPIs and metrics. Confirm that your Amazon RDS databases meet your defined requirements and identify areas of potential improvements to reach your business goals. Validate the effectiveness of the measured metrics and evaluated KPIs, and revise them if necessary. For example, let's say that you set a KPI for the optimal number of concurrent database connections, and you monitor metrics regarding attempted and failed connections as well as user threads that were created and are running. You might have more database connections than those defined by your KPI baseline. By analyzing your current metrics, you can detect the outcome but you might not be able to determine the root cause. If so, you should revise your metrics and include additional monitoring measures, such as counters for table locks. The new metrics would help determine if the increased number of database connections is caused by unexpected table locks.