Distributed monitoring - Implementing Microservices on AWS

Distributed monitoring

A microservices architecture consists of many different distributed parts that have to be monitored. You can use Amazon CloudWatch to collect and track metrics, centralize and monitor log files, set alarms, and automatically react to changes in your AWS environment. CloudWatch can monitor AWS resources such as Amazon EC2 instances, DynamoDB tables, and Amazon RDS DB instances, as well as custom metrics generated by your applications and services, and any log files your applications generate.


You can use CloudWatch to gain system-wide visibility into resource utilization, application performance, and operational health. CloudWatch provides a reliable, scalable, and flexible monitoring solution that you can start using within minutes. You no longer need to set up, manage, and scale your own monitoring systems and infrastructure. In a microservices architecture, the capability of monitoring custom metrics using CloudWatch is an additional benefit because developers can decide which metrics should be collected for each service. In addition, dynamic scaling can be implemented based on custom metrics.

In addition to Amazon CloudWatch, you can also use CloudWatch Container Insights to collect, aggregate, and summarize metrics and logs from your containerized applications and microservices. CloudWatch Container Insights automatically collects metrics for many resources, such as CPU, memory, disk, and network and aggregate as CloudWatch metrics at the cluster, node, pod, task, and service level. Using CloudWatch Container Insights, you can gain access to CloudWatch Container Insights dashboard metrics. It also provides diagnostic information, such as container restart failures, to help you isolate issues and resolve them quickly. You can also set CloudWatch alarms on metrics that Container Insights collects.

Container Insights is available for Amazon ECS, Amazon EKS, and Kubernetes platforms on Amazon EC2. Amazon ECS support includes support for Fargate.

Another popular option, especially for Amazon EKS, is to use Prometheus. Prometheus is an open-source monitoring and alerting toolkit that is often used in combination with Grafana to visualize the collected metrics. Many Kubernetes components store metrics at /metrics and Prometheus can scrape these metrics at a regular interval.

Amazon Managed Service for Prometheus (AMP) is a Prometheus-compatible monitoring service that enables you to monitor containerized applications at scale. With AMP, you can use the open-source Prometheus query language (PromQL) to monitor the performance of containerized workloads without having to manage the underlying infrastructure required to manage the ingestion, storage, and querying of operational metrics. You can collect Prometheus metrics from Amazon EKS and Amazon ECS environments, using AWS Distro for OpenTelemetry or Prometheus servers as collection agents.

AMP is often used in combination with Amazon Managed Service for Grafana (AMG). AMG makes it easy to query, visualize, alert on and understand your metrics no matter where they are stored. With AMG, you can analyze your metrics, logs, and traces without having to provision servers, configure and update software, or do the heavy lifting involved in securing and scaling Grafana in production.

Centralizing logs

Consistent logging is critical for troubleshooting and identifying issues. Microservices enable teams to ship many more releases than ever before and encourage engineering teams to run experiments on new features in production. Understanding customer impact is crucial to gradually improving an application.

By default, most AWS services centralize their log files. The primary destinations for log files on AWS are Amazon S3 and Amazon CloudWatch Logs. For applications running on Amazon EC2 instances, a daemon is available to send log files to CloudWatch Logs. Lambda functions natively send their log output to CloudWatch Logs and Amazon ECS includes support for the awslogs log driver that enables the centralization of container logs to CloudWatch Logs. For Amazon EKS, either Fluent Bit or Fluentd can forward logs from the individual instances in the cluster to a centralized logging CloudWatch Logs where they are combined for higher-level reporting using Amazon OpenSearch Service and Kibana. Because of its smaller footprint and performance advantages, Fluent Bit is recommended instead of FluentD.

The following figure illustrates the logging capabilities of some of the services. Teams are then able to search and analyze these logs using tools like Amazon OpenSearch Service and Kibana. Amazon Athena can be used to run a one-time query against centralized log files in Amazon S3.

Logging capabilities of AWS services

Distributed tracing

In many cases, a set of microservices works together to handle a request. Imagine a complex system consisting of tens of microservices in which an error occurs in one of the services in the call chain. Even if every microservice is logging properly and logs are consolidated in a central system, it can be difficult to find all relevant log messages.

The central idea of AWS X-Ray is the use of correlation IDs, which are unique identifiers attached to all requests and messages related to a specific event chain. The trace ID is added to HTTP requests in specific tracing headers named X-Amzn-Trace-Id when the request hits the first X-Ray-integrated service (for example, Application Load Balancer or API Gateway) and included in the response. Through the X-Ray SDK, any microservice can read, but can also add or update this header.

X-Ray works with Amazon EC2, Amazon ECS, Lambda, and AWS Elastic Beanstalk. You can use X-Ray with applications written in Java, Node.js, and .NET that are deployed on these services.

AWS X-Ray service map

Epsagon is fully managed SaaS that includes tracing for all AWS services, third-party APIs (through HTTP calls), and other common services such as Redis, Kafka, and Elastic. The Epsagon service includes monitoring capabilities, alerting to the most common services, and payload visibility into each and every call your code is making.

AWS Distro for OpenTelemetry is a secure, production-ready, AWS-supported distribution of the OpenTelemetry project. Part of the Cloud Native Computing Foundation, AWS Distro for OpenTelemetry provides open-source APIs, libraries, and agents to collect distributed traces and metrics for application monitoring. With AWS Distro for OpenTelemetry, you can instrument your applications just one time to send correlated metrics and traces to multiple AWS and partner monitoring solutions. Use auto-instrumentation agents to collect traces without changing your code. AWS Distro for OpenTelemetry also collects metadata from your AWS resources and managed services to correlate application performance data with underlying infrastructure data, reducing the mean time to problem resolution. Use AWS Distro for OpenTelemetry to instrument your applications running on Amazon EC2, Amazon ECS, Amazon EKS on Amazon EC2, Fargate, and AWS Lambda, as well as on-premises.

Options for log analysis on AWS

Searching, analyzing, and visualizing log data is an important aspect of understanding distributed systems. Amazon CloudWatch Logs Insights enables you to explore, analyze, and visualize your logs instantly. This allows you to troubleshoot operational problems. Another option for analyzing log files is to use Amazon OpenSearch Service together with Kibana.

Amazon OpenSearch Service can be used for full-text search, structured search, analytics, and all three in combination. Kibana is an open-source data visualization plugin that seamlessly integrates with the Amazon OpenSearch Service.

The following figure demonstrates log analysis with Amazon OpenSearch Service and Kibana. CloudWatch Logs can be configured to stream log entries to Amazon OpenSearch Service in near real time through a CloudWatch Logs subscription. Kibana visualizes the data and exposes a convenient search interface to data stores in Amazon OpenSearch Service. This solution can be used in combination with software like ElastAlert to implement an alerting system to send SNS notifications and emails, create JIRA tickets, and so forth, if anomalies, spikes, or other patterns of interest are detected in the data.

Log analysis with Amazon OpenSearch Service and Kibana

Another option for analyzing log files is to use Amazon Redshift with Amazon QuickSight.

QuickSight can be easily connected to AWS data services, including Amazon Redshift, Amazon RDS, Amazon Aurora, Amazon EMR, DynamoDB, Amazon S3, and Amazon Kinesis

CloudWatch Logs can act as a centralized store for log data, and, in addition to only storing the data, it is possible to stream log entries to Amazon Kinesis Data Firehose.

The following figure depicts a scenario where log entries are streamed from different sources to Amazon Redshift using CloudWatch Logs and Kinesis Data Firehose. Amazon QuickSight uses the data stored in Amazon Redshift for analysis, reporting, and visualization.

Log analysis with Amazon Redshift and Amazon QuickSight

The following figure depicts a scenario of log analysis on Amazon S3. When the logs are stored in Amazon S3 buckets, the log data can be loaded in different AWS data services, such as Amazon Redshift or Amazon EMR, to analyze the data stored in the log stream and find anomalies.

Log analysis on Amazon S3