Select your cookie preferences

We use essential cookies and similar tools that are necessary to provide our site and services. We use performance cookies to collect anonymous statistics, so we can understand how customers use our site and make improvements. Essential cookies cannot be deactivated, but you can choose “Customize” or “Decline” to decline performance cookies.

If you agree, AWS and approved third parties will also use cookies to provide useful site features, remember your preferences, and display relevant content, including relevant advertising. To accept or decline all non-essential cookies, choose “Accept” or “Decline.” To make more detailed choices, choose “Customize.”

Appendix B – Example chi-squared calculation - Advanced Multi-AZ Resilience Patterns

This whitepaper is for historical reference only. Some content might be outdated and some links might not be available.

This whitepaper is for historical reference only. Some content might be outdated and some links might not be available.

Appendix B – Example chi-squared calculation

The following is an example of collecting error metrics and performing a chi-squared test on the data. The code is not production ready and does not perform necessary error handling, but does provide a proof of concept on how the logic works. You should update this example to fit your needs.

First, a Lambda function is invoked each minute by an Amazon EventBridge scheduled event. The content of the event is configured with the following data:

{ "timestamp": "2023-03-15T15:26:37.527Z", "namespace": "multi-az/frontend", "metricName": "5xx", "dimensions": [ { "Name": "Region", "Value": "us-east-1" }, { "Name": "Controller", "Value": "Home" }, { "Name": "Action", "Value": "Index" } ], "period": 60, "stat": "Sum", "unit": "Count", "chiSquareMetricName": "multi-az/chi-squared", "azs": [ "use1-az2", "use1-az4", "use1-az6" ] }

The data is used to specify the common data needed to retrieve the appropriate CloudWatch metrics (like namespace, metric name, and dimensions) and then publish the chi-squared results for each Availability Zone. The code in the Lambda function looks like the following using Python 3.9. At a high level, it collects the specified CloudWatch metrics for the previous minute, runs the chi-squared test on that data, and then publishes CloudWatch metrics about the result of the test for each Availability Zone specified.

import os import boto3 import datetime import copy import json from datetime import timedelta from scipy.stats import chisquare from aws_embedded_metrics import metric_scope cw_client = boto3.client("cloudwatch", os.environ.get("AWS_REGION", "us-east-1")) @metric_scope def handler(event, context, metrics): metrics.set_property("Event", json.loads(json.dumps(event, default = str))) time = datetime.datetime.strptime(event["timestamp"], "%Y-%m-%dT%H:%M:%S.%fZ") # Round down to the previous minute end: datetime = roundTime(time) # Subtract a minute for the start start: datetime = end - timedelta(minutes = 1) # Get all the metrics that match the query results = get_all_metrics(event, start, end, metrics) metrics.set_property("MetricCounts", results) # Calculate the chi squared result chi_sq_result = chisquare(list(results.values())) expected = sum(list(results.values())) / len(results.values()) metrics.set_property("ChiSquaredResult", chi_sq_result) # Put the chi square metrics into CloudWatch put_all_metrics(event, results, chi_sq_result[1], expected, start, metrics) def get_all_metrics(detail: dict, start: datetime, end: datetime, metrics): """ Gets all of the error metrics for each AZ specified """ metric_query = { "MetricDataQueries": [ ], "StartTime": start, "EndTime": end } for az in detail["azs"]: dim = copy.deepcopy(detail["dimensions"]) dim.append({"Name": "AZ-ID", "Value": az}) query = { "Id": az.replace("-", "_"), "MetricStat": { "Metric": { "Namespace": detail["namespace"], "MetricName": detail["metricName"], "Dimensions": dim }, "Period": int(detail["period"]), "Stat": detail["stat"], "Unit": detail["unit"] }, "Label": az, "ReturnData": True } metric_query["MetricDataQueries"].append(query) metrics.set_property("GetMetricRequest", json.loads(json.dumps(metric_query, default=str))) next_token: str = None results = {} while True: if next_token is not None: metric_query["NextToken"] = next_token data = cw_client.get_metric_data(**metric_query) if next_token is not None: metrics.set_property("GetMetricResult::" + next_token, json.loads(json.dumps(data, default = str))) else: metrics.set_property("GetMetricResult", json.loads(json.dumps(data, default = str))) for item in data["MetricDataResults"]: key = item["Id"].replace("_", "-") if key not in results: results[key] = 0 results[key] += sum(item["Values"]) if "NextToken" in data: next_token = data["NextToken"] if next_token is None: break return results def put_all_metrics(detail: dict, results: dict, chi_sq_value: float, expected: float, timestamp: datetime, metrics): """ Adds the chi squared metric for all AZs to CloudWatch """ farthest_from_expected = None if len(results) > 0: keys = list(results.keys()) farthest_from_expected = keys[0] for key in keys: if abs(results[key] - expected) > abs(results[farthest_from_expected] - expected): farthest_from_expected = key metric_query = { "Namespace": detail["namespace"], "MetricData": [] } for az in detail["azs"]: dim = copy.deepcopy(detail["dimensions"]) dim.append({"Name": "AZ-ID", "Value": az}) query = { "MetricName": detail["chiSquareMetricName"], "Dimensions": dim, "Timestamp": timestamp, } if chi_sq_value <= 0.05 and az == farthest_from_expected: query["Value"] = 1 else: query["Value"] = 0 metric_query["MetricData"].append(query) metrics.set_property("PutMetricRequest", json.loads(json.dumps(metric_query, default = str))) cw_client.put_metric_data(**metric_query) def roundTime(dt=None, roundTo=60): """Round a datetime object to any time lapse in seconds dt : datetime.datetime object, default now. roundTo : Closest number of seconds to round to, default 1 minute. """ if dt == None : dt = datetime.datetime.now() seconds = (dt.replace(tzinfo=None) - dt.min).seconds rounding = (seconds+roundTo/2) // roundTo * roundTo return dt + datetime.timedelta(0,rounding-seconds,-dt.microsecond)

You can then create an alarm per AZ. The following example is for use1-az2 and alarms for three, one-minute data points in a row that have a maximum value equal to 1 (1 is the metric being published when the chi-squared test determines statistically significant skew in the error rate).

{ "Type": "AWS::CloudWatch::Alarm", "Properties": { "AlarmName": "use1-az2-chi-squared", "ActionsEnabled": true, "OKActions": [], "AlarmActions": [], "InsufficientDataActions": [], "MetricName": "multi-az/chi-squared", "Namespace": "multi-az/frontend", "Statistic": "Maximum", "Dimensions": [ { "Name": "AZ-ID", "Value": "use1-az2" }, { "Name": "Action", "Value": "Index" }, { "Name": "Region", "Value": "us-east-1" }, { "Name": "Controller", "Value": "Home" } ], "Period": 60, "EvaluationPeriods": 3, "DatapointsToAlarm": 3, "Threshold": 1, "ComparisonOperator": "GreaterThanOrEqualToThreshold", "TreatMissingData": "missing" } }

You can also create an m-of-n alarm and combine these two alarms together with a composite alarm. You would also need to create the same alarms for each Controller/Action combination or microservice you have in each Availability Zone. Finally, you can add the chi-squared composite alarm to the Availability Zone-specific alarm for each Controller/Action combination as shown in Failure detection using outlier detection.

PrivacySite termsCookie preferences
© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved.