Monitoring and observability - Management and Governance Cloud Environment Guide

Monitoring and observability

Like security, monitoring and observability are required for all teams who operate and administer cloud applications and services. As described in the Operational Excellence Pillar whitepaper, your teams must define, capture, and analyze operations metrics to gain visibility into workload events so that you can take appropriate action. In the management layer, this also means understanding operational metrics as you provide guardrails, network, security, and identity services in your management platform.

All of your teams, whether responsible for many cloud environments or a single application, must be able to understand the health of their operations easily. Your teams will want to use metrics based on operations outcomes to gain useful insights. You should use these metrics to make informed decisions, and as key inputs into each of the eight M&G Guide capabilities. AWS makes it easier to bring together and analyze your operations logs so that you can generate metrics, know the status of your operations, and gain insight from operations over time. These activities are supported centrally when you provide an observability solution for consumption, storage, analysis, and presentation of operational data for analysis.

As described in Responding to Events, you should anticipate both planned operational events (such as, sales promotions, deployments, and failure tests) and unplanned ones (such as, surges in utilization and component failures). Use simulations, custom runbooks, and playbooks, and iterate to deliver consistent results when you respond to alerts. Defined alerts should be owned by a role or a team that is accountable for the response and escalations. You will also want to know the business impact of your system components and use this to target efforts when needed. Perform a root cause analysis (RCA) after events, and then introduce necessary changes and controls to prevent recurrence of failures or document workarounds.

In many enterprises, technical teams share integrated systems to monitor the services or infrastructure they manage. Shared observability systems bring together all the performance data for an entire organization, enabling teams to visualize the connections between services and components, collaborate with real-time data, and quickly identify the source of performance or security issues.

Observability systems collect data directly from applications, and AWS logging and service metric capabilities. AWS provides several services that can help increase your monitoring and observability posture. These services include AWS CloudTrail, Amazon CloudWatch, Amazon Managed Service for Prometheus, VPC Flow Logs, AWS X-Ray traces, Amazon EventBridge events, Amazon Managed Grafana, Elastic Load Balancing, and AWS Network Firewall.