Designing and implementing logging and monitoring with Amazon CloudWatch - AWS Prescriptive Guidance

Designing and implementing logging and monitoring with Amazon CloudWatch

Khurram Nizami, Amazon Web Services (AWS)

April 2023 (document history)

This guide helps you design and implement logging and monitoring with Amazon CloudWatch and related Amazon Web Services (AWS) management and governance services for workloads that use Amazon Elastic Compute Cloud (Amazon EC2) instances, Amazon Elastic Container Service (Amazon ECS), Amazon Elastic Kubernetes Service (Amazon EKS), AWS Lambda, and on-premises servers. The guide is intended for operations teams, DevOps engineers, and application engineers that manage workloads on the AWS Cloud.

Your logging and monitoring approach should be based on the six pillars of the AWS Well-Architected Framework. These pillars are operational excellence, security, reliability, performance efficiency, and cost optimization. A well-architected monitoring and alarming solution improves reliability and performance by helping you proactively analyze and adjust your infrastructure.

This guide doesn't extensively discuss logging and monitoring for security or cost-optimization because these are topics that require in-depth evaluation. There are many AWS services that support security logging and monitoring, including AWS CloudTrail, AWS Config, Amazon Inspector, Amazon Detective, Amazon Macie, Amazon GuardDuty, and AWS Security Hub. You can also use AWS Cost Explorer, AWS Budgets, and CloudWatch billing metrics for cost optimization.

The following table outlines the six areas that your logging and monitoring solution should address.

Capturing and ingesting log files and metrics Identify, configure, and send system and application logs and metrics to AWS services from different sources.
Searching and analyzing logs Search and analyze logs for operations management, problem identification, troubleshooting, and applications analysis.
Monitoring metrics and alarming Identify and act on observations and trends in your workloads.
Monitoring application and service availability Reduce downtime and improve your ability to meet service level targets by continuously monitoring service availability.
Tracing applications Trace application requests in systems and external dependencies to fine-tune performance, perform root cause analysis, and troubleshoot issues.
Creating dashboards and visualizations Create dashboards that focus on relevant metrics and observations for your systems and workloads, which helps continuous improvement and proactive discovery of issues.

CloudWatch can meet most logging and monitoring requirements, and provides a reliable, scalable, and flexible solution. Many AWS services automatically provide CloudWatch metrics, in addition to CloudWatch logging integration for monitoring and analysis. CloudWatch also provides agents and log drivers to support a variety of compute options such as servers (both in the cloud and on premises), containers, and serverless computing. This guide also covers the following AWS services that are used with logging and monitoring:

The AWS compute services that you choose also affect the implementation and configuration of your logging and monitoring solution. For example, CloudWatch's implementation and configuration is different for Amazon EC2, Amazon ECS, Amazon EKS, and Lambda.

Application and workload owners can often forget about logging and monitoring or inconsistently configure and implement it. This means that workloads enter production with limited observability, which causes delays in identifying issues and increases the time taken to troubleshoot and resolve them. At a minimum, your logging and monitoring solution must address the systems layer for the operating system (OS)-level logs and metrics, in addition to the application layer for application logs and metrics. The guide provides a recommended approach for addressing these two layers across different compute types, including the three compute types outlined in the following table.

Long-running and immutable EC2 instances System and application logs and metrics across multiple operating systems (OSs) in multiple AWS Regions or accounts.
Containers System and application logs and metrics for your Amazon ECS and Amazon EKS clusters, including examples for different configurations.
Serverless System and application logs and metrics for your Lambda functions and considerations for customization.

This guide provides a logging and monitoring solution that addresses CloudWatch and related AWS services in the following areas:

Implementation examples are used throughout this guide across these areas and are also available from the AWS Samples GitHub repository.