Appendix B – Example chi-squared calculation - Advanced Multi-AZ Resilience Patterns

Appendix B – Example chi-squared calculation

The following is an example of collecting error metrics and performing a chi-squared test on the data. The code is not production ready and does not perform necessary error handling, but does provide a proof of concept on how the logic works. You should update this example to fit your needs.

First, a Lambda function is invoked each minute by an Amazon EventBridge scheduled event. The content of the event is configured with the following data:

{ "timestamp": "2023-03-15T15:26:37.527Z", "namespace": "multi-az/frontend", "metricName": "5xx", "dimensions": [ { "Name": "Region", "Value": "us-east-1" }, { "Name": "Controller", "Value": "Home" }, { "Name": "Action", "Value": "Index" } ], "period": 60, "stat": "Sum", "unit": "Count", "chiSquareMetricName": "multi-az/chi-squared", "azs": [ "use1-az2", "use1-az4", "use1-az6" ] }

The data is used to specify the common data needed to retrieve the appropriate CloudWatch metrics (like namespace, metric name, and dimensions) and then publish the chi-squared results for each Availability Zone. The code in the Lambda function looks like the following using Python 3.9. At a high level, it collects the specified CloudWatch metrics for the previous minute, runs the chi-squared test on that data, and then publishes CloudWatch metrics about the result of the test for each Availability Zone specified.

import os import boto3 import datetime import copy import json from datetime import timedelta from scipy.stats import chisquare from aws_embedded_metrics import metric_scope cw_client = boto3.client("cloudwatch", os.environ.get("AWS_REGION", "us-east-1")) @metric_scope def handler(event, context, metrics): metrics.set_property("Event", json.loads(json.dumps(event, default = str))) time = datetime.datetime.strptime(event["timestamp"], "%Y-%m-%dT%H:%M:%S.%fZ") # Round down to the previous minute end: datetime = roundTime(time) # Subtract a minute for the start start: datetime = end - timedelta(minutes = 1) # Get all the metrics that match the query results = get_all_metrics(event, start, end, metrics) metrics.set_property("MetricCounts", results) # Calculate the chi squared result chi_sq_result = chisquare(list(results.values())) expected = sum(list(results.values())) / len(results.values()) metrics.set_property("ChiSquaredResult", chi_sq_result) # Put the chi square metrics into CloudWatch put_all_metrics(event, results, chi_sq_result[1], expected, start, metrics) def get_all_metrics(detail: dict, start: datetime, end: datetime, metrics): """ Gets all of the error metrics for each AZ specified """ metric_query = { "MetricDataQueries": [ ], "StartTime": start, "EndTime": end } for az in detail["azs"]: dim = copy.deepcopy(detail["dimensions"]) dim.append({"Name": "AZ-ID", "Value": az}) query = { "Id": az.replace("-", "_"), "MetricStat": { "Metric": { "Namespace": detail["namespace"], "MetricName": detail["metricName"], "Dimensions": dim }, "Period": int(detail["period"]), "Stat": detail["stat"], "Unit": detail["unit"] }, "Label": az, "ReturnData": True } metric_query["MetricDataQueries"].append(query) metrics.set_property("GetMetricRequest", json.loads(json.dumps(metric_query, default=str))) next_token: str = None results = {} while True: if next_token is not None: metric_query["NextToken"] = next_token data = cw_client.get_metric_data(**metric_query) if next_token is not None: metrics.set_property("GetMetricResult::" + next_token, json.loads(json.dumps(data, default = str))) else: metrics.set_property("GetMetricResult", json.loads(json.dumps(data, default = str))) for item in data["MetricDataResults"]: key = item["Id"].replace("_", "-") if key not in results: results[key] = 0 results[key] += sum(item["Values"]) if "NextToken" in data: next_token = data["NextToken"] if next_token is None: break return results def put_all_metrics(detail: dict, results: dict, chi_sq_value: float, expected: float, timestamp: datetime, metrics): """ Adds the chi squared metric for all AZs to CloudWatch """ farthest_from_expected = None if len(results) > 0: keys = list(results.keys()) farthest_from_expected = keys[0] for key in keys: if abs(results[key] - expected) > abs(results[farthest_from_expected] - expected): farthest_from_expected = key metric_query = { "Namespace": detail["namespace"], "MetricData": [] } for az in detail["azs"]: dim = copy.deepcopy(detail["dimensions"]) dim.append({"Name": "AZ-ID", "Value": az}) query = { "MetricName": detail["chiSquareMetricName"], "Dimensions": dim, "Timestamp": timestamp, } if chi_sq_value <= 0.05 and az == farthest_from_expected: query["Value"] = 1 else: query["Value"] = 0 metric_query["MetricData"].append(query) metrics.set_property("PutMetricRequest", json.loads(json.dumps(metric_query, default = str))) cw_client.put_metric_data(**metric_query) def roundTime(dt=None, roundTo=60): """Round a datetime object to any time lapse in seconds dt : datetime.datetime object, default now. roundTo : Closest number of seconds to round to, default 1 minute. """ if dt == None : dt = datetime.datetime.now() seconds = (dt.replace(tzinfo=None) - dt.min).seconds rounding = (seconds+roundTo/2) // roundTo * roundTo return dt + datetime.timedelta(0,rounding-seconds,-dt.microsecond)

You can then create an alarm per AZ. The following example is for use1-az2 and alarms for three, one-minute data points in a row that have a maximum value equal to 1 (1 is the metric being published when the chi-squared test determines statistically significant skew in the error rate).

{ "Type": "AWS::CloudWatch::Alarm", "Properties": { "AlarmName": "use1-az2-chi-squared", "ActionsEnabled": true, "OKActions": [], "AlarmActions": [], "InsufficientDataActions": [], "MetricName": "multi-az/chi-squared", "Namespace": "multi-az/frontend", "Statistic": "Maximum", "Dimensions": [ { "Name": "AZ-ID", "Value": "use1-az2" }, { "Name": "Action", "Value": "Index" }, { "Name": "Region", "Value": "us-east-1" }, { "Name": "Controller", "Value": "Home" } ], "Period": 60, "EvaluationPeriods": 3, "DatapointsToAlarm": 3, "Threshold": 1, "ComparisonOperator": "GreaterThanOrEqualToThreshold", "TreatMissingData": "missing" } }

You can also create an m-of-n alarm and combine these two alarms together with a composite alarm. You would also need to create the same alarms for each Controller/Action combination or microservice you have in each Availability Zone. Finally, you can add the chi-squared composite alarm to the Availability Zone-specific alarm for each Controller/Action combination as shown in Failure detection using outlier detection.