Observability - AWS Cloud Adoption Framework: Operations Perspective

Observability

Gain actionable insights from your infrastructure and application data.

Observability describes how well you can understand what is happening in a system, often by instrumenting it to collect metrics, logs, or traces. In the cloud, observability can be hard to achieve due to sheer system complexity. Whether in data centers or in the cloud, to achieve operational excellence and meet business objectives, you need to understand how your systems are performing. Observability solutions enable you to collect and analyze data from applications and infrastructure so you can understand their internal states and be alerted to, troubleshoot, and resolve issues with application availability and performance to improve the end user experience.

The primary purpose of observability is to enable you to detect and investigate problems, but it also has a secondary purpose that enables you to define and measure Key Performance Indicators (KPIs) and Service Level Objectives (SLOs), such as uptime. For most organizations, important operations KPIs include mean time to detect (MTTD) and mean time to recover (MTTR) from an incident.

As well as providing operations teams with actionable data, observability helps you prioritize your initiatives by enabling you to determine if you are serving the needs of your customers and meeting your business outcomes. Observability can help you find performance improvements in your cloud resources that in turn enable you to reduce costs and improve the customer experience.

“You can’t improve what you don’t measure.” – Peter Drucker

Start

You’ll often see monitoring, tracing, and logging described as the “three pillars of observability”. To start with, you should focus on collecting, visualizing, alerting, and analyzing logs and metrics. In this section, we’ll also discuss using synthetic transactions to check the availability and performance of your applications because they provide powerful insights with minimal effort.

Use Amazon CloudWatch Metrics to centralize data about the performance of your systems. By default, many AWS services provide free metrics for resources, such as Amazon Elastic Compute Cloud (Amazon EC2) instances, Amazon Elastic Block Store (Amazon EBS) volumes, and Amazon Relational Database Service (Amazon RDS) instances. Visualizing and alerting on these infrastructure metrics should be your starting point.

Use Amazon CloudWatch Logs to centralize the logs from all of your workloads, in a single, highly-scalable service. You can then easily view logs, search them for specific error codes or patterns, filter them based on specific fields, or archive them securely for future analysis. Many AWS services also provide vended logs that enable you to troubleshoot issues.

If you are at the beginning of your cloud journey or you are migrating to cloud-native observability tools for the first time, you may want to consider your EC2 instances as your starting point, beyond vended metrics and logs provided by AWS. To start collecting logs and metrics from EC2 instances and on-premises servers, install and run the CloudWatch agent using AWS Systems Manager.

Once you have metrics and logs from managed services and EC2 instances, you can begin to visualize them while triggering alerts when thresholds are breached using Amazon CloudWatch alarms. It is important to test your applications prior to deployment in production, to fully understand which metrics and log events indicate that there could be an issue. Use a combination of load, exception, and smoke testing (using synthetic transactions) to establish a baseline where possible. If it is not possible to establish a baseline, or if your application has a predictable but variable load, use anomaly detection to apply statistical and machine learning algorithms to predict and surface anomalies with minimal user intervention. Use metrics explorer and CloudWatch Metrics Insights to explore your metric data in more detail.

Amazon CloudWatch dashboards are customizable home pages in the CloudWatch console that you can use to monitor your resources in a single view (even those resources that are spread across different Regions). Use CloudWatch dashboards to visualize your infrastructure and applications. Start with AMS and EC2 instances, then consider expanding your visualizations as you mature in your adoption of other managed services, such as serverless services like AWS Lambda and Amazon DynamoDB.

The moment you create your dashboards is also the ideal time to create alerts using CloudWatch alarms, because you will be considering which metrics are important to visualize and measure. A metric alarm watches a single CloudWatch metric, or the result of a math expression based on multiple CloudWatch metrics. The alarm performs one or more actions based on the value of the metric or expression relative to a threshold over a number of time periods. The action can send a notification to an Amazon Simple Notification Service (Amazon SNS) topic, perform an EC2 action or an EC2 Auto Scaling action, or create operational work items (OpsItem) or incident in AWS Systems Manager. At this stage, focus on alerting the appropriate people as quickly as possible.

If you have migrated from on-premises, it’s also likely that you have important insights in your logs that you may want to know about. Use CloudWatch log metric filters to search for a pattern in logs in near real-time and expose that data as a metric. For example, an error encountered in a log file could be counted, exposed as a metric using a log metric filter, and then alerted upon using CloudWatch alarms.

When you have application and infrastructure metrics and logs in place, you can also add synthetic transactions. They are easy to configure and can provide critical data on performance and availability. In the absence of a better metric, availability metrics from synthetic canaries can also be used to measure SLOs or agreements. You can use Amazon CloudWatch Synthetics to create canaries (configurable scripts that run on a schedule), to monitor your endpoints and application programming interfaces (APIs). Canaries follow the same routes and perform the same actions as a customer or user, which makes it possible for you to continually verify their experience, even when you don't have any traffic on your workloads.

Advance

Once you have metrics and alerts using CloudWatch alarms, use composite alarms to reduce noise and alert fatigue. A composite alarm includes a rule expression that considers the alarm states of other alarms that you have created. The composite alarm goes into ALARM state only if all conditions of the rule are met. The alarms specified in a composite alarm's rule expression can include metric alarms and other composite alarms.

As you begin to advance in your observability journey, log analytics becomes increasingly important to enable you to troubleshoot infrastructure and application issues. Use CloudWatch Log Insights and Contributor Insights for log analytics. Use CloudWatch Logs Insights to interactively search and analyze your log data in CloudWatch Logs. You can perform queries to help you more efficiently and effectively respond to operational issues and reduce your mean time to recovery (MTTR). You can use Contributor Insights to analyze log data and create time series that display contributor data. You can see metrics about the top-N (for example, top 10) contributors, the total number of unique contributors, and their usage. This helps you find top talkers (contributors to traffic) and understand who or what is impacting system performance. For example, you can find bad hosts, identify the heaviest network users, or find the URLs that generate the most errors.

Now that you have metrics, logs, and alerting in place, it’s time to look at complete tracing of transactions through what are likely to become more distributed systems as your business matures. This stage will complete the three pillars of observability. AWS X-Ray receives traces from your application in addition to AWS services your application uses that are already integrated with X-Ray. Instrumenting your application involves sending trace data for incoming and outbound requests and other events within your application, along with metadata about each request. Many instrumentation scenarios require only configuration changes. For example, you can instrument all incoming HTTP requests and downstream calls to AWS services that your Java application makes.

You can correlate your logs with your traces using software development kits (SDKs) that support this feature, or by injecting your logs with your trace IDs to simplify troubleshooting. View your services and applications in CloudWatch ServiceLens to visualize a complete view of your application with correlated metrics, logs, and traces. Use the visualization and associated traces, metrics, and logs to efficiently pinpoint performance bottlenecks or faults, helping you reduce your MTTR by reducing the time required to understand where an issue occurs. Use ServiceLens metrics, such as latency, to alert you before your customers notice there is a potential issue.

Excel

By now, you should have the three pillars of observability in place, so you can easily troubleshoot and get to the root cause of an issue quickly and efficiently.

However, there is still more that can be done to measure complete application performance and make better use of existing data. To begin with, use CloudWatch Real User Monitoring (RUM) to monitor your web applications. With Amazon CloudWatch RUM, you can perform real user monitoring to collect client-side data about your web application performance from actual user sessions in near real-time. The data collected includes page load times, client-side errors, and user behavior. You can use the collected data to quickly identify and debug client-side performance issues and understand the user impact, including the number of users impacted, geolocations, and browsers.

As your adoption of AWS matures, you will probably find yourself using more serverless services and containers. CloudWatch Lambda Insights is a monitoring and troubleshooting solution for serverless applications running on AWS Lambda. The solution collects, aggregates, and summarizes system-level metrics, including CPU time, memory, disk, and network. It also collects, aggregates, and summarizes diagnostic information, such as cold starts and Lambda worker shutdowns to help you isolate issues with your Lambda functions and resolve them quickly.

Use CloudWatch Container Insights to collect, aggregate, and summarize metrics and logs from your containerized applications and microservices. CloudWatch automatically collects metrics for many resources, such as CPU, memory, disk, and network. Container Insights also provides diagnostic information, such as container restart failures, to help you isolate issues and resolve them quickly. You can visualize a map of your container resources, CloudWatch alarms related to metrics that Container Insights collects, and performance metrics.

At this point in your journey, you may be considering how to optimize cost when collecting high cardinality metrics at scale. The CloudWatch embedded metric format enables you to ingest complex high-cardinality application data in the form of logs, and to generate actionable metrics from them. You can embed custom metrics alongside detailed log event data, and CloudWatch automatically extracts the custom metrics so that you can visualize and alarm on them, for near real-time incident detection. This can help to reduce costs by reducing the requirement to call the PutMetricData API, and also help simplify the collection of metrics for Lambda and containers.

You may also want to consider some open-source options, particularly for the collection of metrics and traces. At the time of writing, logs are not available as part of OpenTelemetry; instead, use the CloudWatch agent or FluentBit to send logs to CloudWatch logs. You can use the AWS Distro for OpenTelemetry to collect metrics and traces in a variety of source formats from Amazon EC2, AWS Lambda, Amazon Elastic Container Service (Amazon ECS), Amazon Elastic Kubernetes Service (Amazon EKS), and on-premises infrastructure, and send them to managed services such as X-Ray and CloudWatch as well as Amazon Managed Service for Prometheus. This gives your developers more options whilst maintaining data in a centrally managed location.

If you want to visualize data from more sources than CloudWatch and X-Ray, or if you want to manage dashboards for non-IAM users, you can use Amazon Managed Grafana to create dashboards from AWS data sources such as Amazon CloudWatch, Amazon OpenSearch Service, AWS X-Ray, AWS IoT SiteWise, Amazon Timestream, and Amazon Managed Service for Prometheus as well as many popular open-source, third-party, and other cloud data sources.