Overview - AWS Prescriptive Guidance

Overview

Monitoring and alerting are included in four pillars of the AWS Well-Architected Framework.

  • The operational excellence pillar prescribes that your workload should be designed to include telemetry and monitoring. AWS services such as Amazon Relational Database Service (Amazon RDS) provide the information necessary for you to understand your workload's internal state (for example, metrics, logs, events, and traces). When you operate your Amazon RDS databases, you will want to understand the health of your databases instances, detect operational events, and be able to respond to both planned and unplanned events. AWS provides monitoring tools that help you determine when organizational and business outcomes are at risk, or potentially might be at risk, so you can take the appropriate actions at the right time.

  • The performance efficiency pillar prescribes that you should monitor the performance of your resources such as Amazon RDS DB instances by gathering, aggregating, and processing performance-related metrics in real time. You can identify performance degradation and remediate the factors―for example, unoptimized SQL queries or inadequate configuration parameters―that caused it. You can raise alarms automatically when measurements are outside expected boundaries. We recommend that you use alarms not only for notifications, but also to initiate automated actions in response to the detected events. You can evaluate the metrics you collect against predefined thresholds or use machine learning algorithms to identify anomalous behavior. For example, to detect a trend of increased CPU utilization, you can gather and analyze the cpuUtilization.total metric over a period of time. Alerting on that anomaly proactively, before CPU utilization hits the hard limit, can help you remediate the issue before it impacts your customers.

  • The reliability pillar defines monitoring and alerting as critical to ensure that you are meeting your availability requirements. Your monitoring solution must be able to detect failures effectively. When it detects problems or failures, its primary objective is to alert on those issues. Implementing continuous observability and monitoring practices is imperative for resilient architectures in the cloud. To improve your workloads, you must be able to measure them and understand their state and health. Design principles for automatic recovery from a failure, horizontal scalability, and capacity provisioning depend on accurate monitoring and alerting services.

  • The security pillar discusses detection and prevention of unexpected or unwanted configuration changes, and unexpected behavior. You can configure your Amazon RDS for MySQL and MariaDB DB instances with the MariaDB Audit Plugin to record database activity such as user logins and specific operations run against the database. The plugin stores the record of database activity in a log file, which can be integrated and imported into monitoring and alerting tools. The log file is analyzed in real time for unexpected or suspicious behavior in your database. Such unexpected or suspicious behavior can indicate that your Amazon RDS DB instance has been compromised, which signals potential risks to your business. If the monitoring tool detects such an event, it activates an alarm to initiate a response to the security incident, which helps address suspicious and malicious activities.

Targeted business outcomes

Implementing best practices in monitoring and alerting mechanisms helps you ensure a high-performing, resilient, efficient, secure, and cost-optimized infrastructure for your applications and workloads. You can use observability tools that gather, store, and visualize metrics, events, traces, and logs in real time to observe and analyze the bigger picture of the health and performance of your databases, and thus prevent the degradation or disruption of your associated IT services. If unplanned degradation or service disruption still occurs, monitoring and alerting tools help you with timely detection of the problem, escalation, reaction, and quick investigation and resolution. A comprehensive monitoring and alerting solution for your cloud database workloads helps you achieve the following business outcomes:

  • Improve customer experience. Reliable service improves your customers' experiences. Databases are often a key component of digital services such as web and mobile applications, media streaming, payments, business-to-business (B2B) APIs, and integration services. If you can monitor and set up alerts on your databases to detect problems quickly, investigate them efficiently, and remediate them as soon as possible to minimize downtime and other disruptions, you can enhance the availability, security, and performance of the digital service for your customers.

  • Build customer trust. Better performance and a smoother user experience help you win your customers' trust, which can result in more business on your platform. For example, a payment processing service provider that offers a reliable online service can expect high customer trust and loyalty, which results in more customers and better retention, an increase in billable transactions, and new, innovative services that generate more revenue.

  • Avoid financial losses. Any unexpected downtime in your database infrastructure can impact the business transactions that your customers perform by using your application. This can lead to substantial financial losses in some cases. Breaching service-level agreements (SLAs) might result in loss of customer trust, and, consequently, a loss of revenue. It can also become a legal basis for expensive trials, where customers might demand compensation based on your liability and warranty contracts. According to a study by Atlassian Corporation, a software company, the average costs of service outage are in the range of $140K‒$540K per hour, depending on the type and size of the business. A stable database environment is key to preventing long outages and loss of business.

  • Expand value. Monitoring and alerting mechanisms can help you design, develop, and operate a highly available, resilient, reliable, performant, cost-effective, and secure digital service, but it's only the beginning. You will want your organization to scale and thrive over time, enhance existing cloud workloads, and introduce new services. New services provide additional value to your customers and more revenue to your business, creating a flywheel effect on the growth of your business.

  • Improve developer productivity. Developers who are productive and efficient, and who don't encounter issues and bottlenecks in their development tasks, can deliver high-quality products in a shorter time. However, software engineering and IT operations often have complex challenges, and this complexity increases with the scale of workloads and their architectures. To analyze performance and consistency across distributed applications, developers need tools that can provide correlated metrics and traces. These help identify defective code artifacts and infrastructure components as quickly as possible, and help determine impacts on end users. The right suite of monitoring and alerting tools can help developers code and test better and faster.

  • Improve operational effectiveness and efficiency. When you operate cloud workloads at scale, even a small percentage of performance improvements can result in millions of dollars in savings. By monitoring your databases and analyzing metrics, events, logs, and traces, you can understand and predict your future capacity needs, and can take advantage of the cost savings available in the AWS Cloud. Understanding your Amazon RDS workloads and operational health can help you respond to events, fix issues, and plan improvements.