Stage 4: Operate

This stage focuses on the operational practices your system uses to maintain and improve resilience. Building a culture of operational excellence helps establish standards and consistency for these practices.

Observability

To effectively measure resiliency, monitor your applications from both the server side and the client side.

For server-side monitoring, use your preferred infrastructure monitoring tools. With these tools, set up dashboards, alarms, and notifications for any breaches to key performance indicators (KPIs) for infrastructure or applications. If you are using Amazon CloudWatch, configure dashboards and alarms to provide alerts about any infrastructure or application failures.

Use Service Quotas, an AWS service that helps you manage your quotas for many AWS services. This helps prevent service quota issues when scaling your production workloads on AWS. You can also create CloudWatch alarms that alert you when you approach the threshold for critical services in your workload.

For client-side monitoring, set up synthetic monitoring by creating scripts or canaries that monitor your application for metrics, such as availability, page load times, or broken links. Synthetic monitoring continually verifies your customers' experience by following the same routes and actions as customers. You can use Amazon CloudWatch Synthetics canaries to visually monitor your applications. The visual monitoring blueprint helps you quickly create or update canaries that compare screenshots of your running application against baseline screenshot. You can also use CloudWatch RUM to perform real user monitoring to collect and view client-side data about your web application performance from actual user sessions in near real time. You can visualize and analyze data, such as page load times, client-side errors, and user behavior.

Additional resources:

Continuous resilience

Periodically review the resilience posture of your mission critical workloads by using the AWS Well-Architected Tool. Also consider running a game day to simulate a known event against which you have established resilience mechanisms. For example, you might simulate an Availability Zone impairment and run a multi-AZ failover. Although implementing these activities might require a significant level of effort, both practices build confidence that your workload is resilient against the failure modes that you've designed it to withstand.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Stage 3: Evaluate and test

Stage 5: Respond and learn