Distributed Monitoring - Implementing Microservices on AWS

Distributed Monitoring

A microservices architecture consists of many different distributed parts that have to be monitored.

You can use Amazon CloudWatch to collect and track metrics, centralize and monitor log files, set alarms, and automatically react to changes in your AWS environment. CloudWatch can monitor AWS resources such as EC2 instances, DynamoDB tables, and RDS DB instances, as well as custom metrics generated by your applications and services, and any log files your applications generate.


You can use CloudWatch to gain system-wide visibility into resource utilization, application performance, and operational health. CloudWatch provides a reliable, scalable, and flexible monitoring solution that you can start using within minutes. You no longer need to set up, manage, and scale your own monitoring systems and infrastructure. In a microservices architecture, the capability of monitoring custom metrics using CloudWatch is an additional benefit because developers can decide which metrics should be collected for each service. In addition to that, dynamic scaling can be implemented based on custom metrics.

Another popular option–especially for Amazon EKS–is to use Prometheus. Prometheus is an open-source monitoring and alerting toolkit that is often used in combination with Grafana to visualize the collected metrics. Many Kubernetes components store metrics at /metrics and Prometheus can scrape these metrics at a regular interval.

Centralizing Logs

Consistent logging is critical for troubleshooting and identifying issues. Microservices allow teams to ship many more releases than ever before and encourage engineering teams to run experiments on new features in production. Understanding customer impact is crucial to improving an application gradually.

Most AWS services centralize their log files by default. The primary destinations for log files on AWS are Amazon S3 and Amazon CloudWatch Logs. For applications running on EC2 instances, a daemon is available to send log files to CloudWatch Logs. Lambda functions natively send their log output to CloudWatch Logs and Amazon ECS includes support for the awslogs log driver that allows the centralization of container logs to CloudWatch Logs. For Amazon EKS, it is necessary to run FluentD which forwards logs from the individual instances in the cluster to a centralized logging CloudWatch Logs where they are combined for higher-level reporting using Elasticsearch and Kibana.

Figure 10 illustrates the logging capabilities of some of the services. Teams are then able to search and analyze these logs using tools like Amazon Elasticsearch Service (Amazon ES) and Kibana. Amazon Athena can be used to run ad hoc queries against centralized log files in Amazon S3.

Figure 10: Logging capabilities of AWS services

Distributed Tracing

In many cases, a set of microservices works together to handle a request. Imagine a complex system consisting of tens of microservices in which an error occurs in one of the services in the call chain. Even if every microservice is logging properly and logs are consolidated in a central system, it can be difficult to find all relevant log messages.

The central idea behind AWS X-Ray is the use of correlation IDs, which are unique identifiers attached to all requests and messages related to a specific event chain. The trace ID is added to HTTP requests in specific tracing headers named X-Amzn-Trace-Id when the request hits the first X-Ray-integrated service (for example, Application Load Balancer or API Gateway) and included in the response. Via the X-Ray SDK, any microservice can read but can also add or update this header.

AWS X-Ray works with Amazon EC2, Amazon ECS, Lambda, and AWS Elastic Beanstalk. You can use X-Ray with applications written in Java, Node.js, and .NET that are deployed on these services.

Figure 11: AWS X-Ray service map

Options for Log Analysis on AWS

Searching, analyzing, and visualizing log data is an important aspect of understanding distributed systems. Amazon CloudWatch Logs Insights is a great service to explore, analyze, and visualize your logs instantly. This allows you to troubleshoot operational problems. Another option for analyzing log files is to use Amazon ES together with Kibana.

Amazon ES can be used for full-text search, structured search, analytics, and all three in combination. Kibana is an open source data visualization plugin for Amazon ES that seamlessly integrates with it.

Figure 12 demonstrates log analysis with Amazon ES and Kibana. CloudWatch Logs can be configured to stream log entries to Amazon ES in near real time through a CloudWatch Logs subscription. Kibana visualizes the data and exposes a convenient search interface to data stores in Amazon ES. This solution can be used in combination with software like ElastAlert to implement an alerting system in order to send SNS notifications, emails, create JIRA tickets, etc., if anomalies, spikes, or other patterns of interest are detected in the data.

Figure 12: Log analysis with Amazon Elasticsearch Service and Kibana

Another option for analyzing log files is to use Amazon Redshift together with Amazon QuickSight.

Amazon QuickSight can be easily connected to AWS data services, including Amazon Redshift, Amazon RDS, Amazon Aurora, Amazon EMR, Amazon DynamoDB, Amazon S3, and Amazon Kinesis.

Amazon CloudWatch Logs can act as a centralized store for log data, and, in addition to only storing the data, it is possible to stream log entries to Amazon Kinesis Data Firehose.

Figure 13 depicts a scenario where log entries are streamed from different sources to Amazon Redshift using CloudWatch Logs and Kinesis Data Firehose. Amazon QuickSight uses the data stored in Amazon Redshift for analysis, reporting, and visualization.

Figure 13: Log analysis with Amazon Redshift and Amazon QuickSight

Figure 14 depicts a scenario of log analysis on Amazon S3. When the logs are stored in S3 buckets, the log data can be loaded in different AWS data services, such as Amazon Redshift or Amazon EMR, to analyze the data stored in the log stream and find anomalies.

Figure 14: Log analysis on Amazon S3