Understanding Workload Health - Operational Excellence Pillar

Understanding Workload Health

Define, capture, and analyze workload metrics to gain visibility to workload events so that you can take appropriate action.

Your team should be able to understand the health of your workload easily. You will want to use metrics based on workload outcomes to gain useful insights. You should use these metrics to implement dashboards with business and technical viewpoints that will help team members make informed decisions.

AWS makes it easy to bring together and analyze your workload logs so that you can generate metrics, understand the health of your workload, and gain insight from operations over time.

Identify key performance indicators: Identify key performance indicators (KPIs) based on desired business outcomes (for example, order rate, customer retention rate, and profit versus operating expense) and customer outcomes (for example, customer satisfaction). Evaluate KPIs to determine workload success.

Define workload metrics: Define workload metrics to measure the achievement of KPIs (for example, abandoned shopping carts, orders placed, cost, price, and allocated workload expense). Define workload metrics to measure the health of the workload (for example, interface response time, error rate, requests made, requests completed, and utilization). Evaluate metrics to determine if the workload is achieving desired outcomes, and to understand the health of the workload.

You should send log data to a service like CloudWatch Logs, and generate metrics from observations of necessary log content.

CloudWatch has specialized features like Amazon CloudWatch Insights for .NET and SQL Server and Container Insights that can assist you by identifying and setting up key metrics, logs, and alarms across your specifically supported application resources and technology stack.

Collect and analyze workload metrics: Perform regular proactive reviews of metrics to identify trends and determine where appropriate responses are needed.

You should aggregate log data from your application, workload components, services, and API calls to a service like CloudWatch Logs. Generate metrics from observations of necessary log content to enable insight into the performance of operations activities.

In the AWS shared responsibility model, portions of monitoring are delivered to you through the AWS Personal Health Dashboard. This dashboard provides alerts and remediation guidance when AWS is experiencing events that might affect you. Customers with Business and Enterprise Support subscriptions also get access to the AWS Health API, enabling integration to their event management systems.

On AWS, you can export your log data to Amazon S3 or send logs directly to Amazon S3 for long-term storage. Using AWS Glue, you can discover and prepare your log data in Amazon S3 for analytics, storing associated metadata in the AWS Glue Data Catalog. Amazon Athena, through its native integration with Glue, can then be used to analyze your log data, querying it using standard SQL. Using a business intelligence tool like Amazon QuickSight you can visualize, explore, and analyze your data.

An alternative solution would be to use the Amazon OpenSearch Service and OpenSearch Dashboards to collect, analyze, and display logs on AWS across multiple accounts and AWS Regions.

Establish workload metrics baselines: Establish baselines for metrics to provide expected values as the basis for comparison and identification of under and over performing components. Identify thresholds for improvement, investigation, and intervention.

Learn expected patterns of activity for workload: Establish patterns of workload activity to identify anomalous behavior so that you can respond appropriately if necessary.

CloudWatch through the CloudWatch Anomaly Detection feature applies statistical and machine learning algorithms to generate a range of expected values that represent normal metric behavior.

Alert when workload outcomes are at risk: Raise an alert when workload outcomes are at risk so that you can respond appropriately if necessary.

Ideally, you have previously identified a metric threshold that you are able to alarm upon or an event that you can use to trigger an automated response.

You can also use CloudWatch Logs Insights to interactively search and analyze your log data using a purpose-built query language. CloudWatch Logs Insights automatically discovers fields in logs from AWS services, and custom log events in JSON. It scales with your log volume and query complexity and gives you answers in seconds, helping you to search for the contributing factors of an incident.

Alert when workload anomalies are detected: Raise an alert when workload anomalies are detected so that you can respond appropriately if necessary.

Your analysis of your workload metrics over time may establish patterns of behavior that you can quantify sufficiently to define an event or raise an alarm in response.

Once trained, the CloudWatch Anomaly Detection feature can be used to alarm on detected anomalies or can provide overlaid expected values onto a graph of metric data for ongoing comparison.

Validate the achievement of outcomes and the effectiveness of KPIs and metrics: Create a business-level view of your workload operations to help you determine if you are satisfying needs and to identify areas that need improvement to reach business goals. Validate the effectiveness of KPIs and metrics and revise them if necessary.

AWS also has support for third-party log analysis systems and business intelligence tools through the AWS service APIs and SDKs (for example, Grafana, Kibana, and Logstash).