FSIREL08: How do you monitor your resources to understand your workloads health? - Financial Services Industry Lens

FSIREL08: How do you monitor your resources to understand your workloads health?

High availability for applications requires the ability to detect failures and recover quickly. Workloads must be configured to emit the relevant telemetry to detect failures, so that operational processes can capture and react to these events.

FSIREL08-BP01 Use a single pane of glass for monitoring

Amazon CloudWatch provides robust monitoring, allowing you to organize the data to escalate detected issues as quickly as possible. Without adequate processes in place, you may miss leading indicators of problems. A single pane of glass and standardizing cloud monitoring standards across your organization can help avoid information silos and simplify the analysis of monitoring data. Combining monitoring of AWS system metrics and workload logs enables analysts to cross-reference signals and log information across dependent systems. Frequently, issues surface in invoking systems, and IT professionals spend time parsing logs on the invoking systems instead of on the dependent systems where the error originated. Consider embedding metrics in logs with Embedded Metric Format (EMF), which allows you to quickly dive from the single pane of glass to the most granular entity of your workload. More information on building efficient dashboards for operational visibility can be found in the The Amazon Builders' Library.

FSIREL08-BP02 Alert on the absence of an event

The absence of monitoring data can indicate an underlying issue. Implement controls that alert on missed reporting intervals. Treat missing data as a security breach, and raise alarms appropriately.

FSIREL08-BP03 Identify metrics and validate alerts through load testing

Workloads must be load-tested regularly to validate scaling and resilience. Identify key metrics (for both components that auto scale with demand and for static resources such as relational databases) that correlate with capacity constraints and customer outages during these load tests.

As part of your load-testing, validate these metrics and associated alerts, ensuring that alerts are issued as expected. Perform load tests in lower environments to identify indicators for alerting and automated remediation. Validation of your indicators and alerts through load testing minimize your Mean Time to Detection (MTTD), giving your recovery mechanisms more time to respond and increasing the workload's availability.

FSIREL08-BP04 Use distributed tracing tools for service-oriented architectures

As systems become more distributed with the implementation of microservices architectures, the challenge of identifying performance bottlenecks increase. Use workload performance monitoring tools such as AWS X-Ray to trace and provide telemetry across multiple systems and on a transaction-by-transaction basis. Adopt tools like AWS X-Ray and Open Telemetry as integrated tools that provide tracing and data as transactions span across multiple services.