Converting an aggregate dashboard to scheduled query
Assume you are computing the fleet-wide statistics such as host counts in the fleet by the five microservices and by the six regions where your service is deployed. From the snapshot below, you can see there are 500K servers emitting metrics, and some of the bigger regions (e.g., us-east-1) have >200K servers.
Computing these aggregates, where you are computing distinct instance names over hundreds of gigabytes of data can result in query latency of tens of seconds, in addition to the cost of scanning the data.
Original dashboard query
The aggregate shown in the dasboard panel is computed, from raw data, using the query below. The query uses multiple SQL constructs, such as distinct counts and multiple aggregation functions.
SELECT CASE WHEN microservice_name = 'apollo' THEN num_instances ELSE NULL END AS apollo, CASE WHEN microservice_name = 'athena' THEN num_instances ELSE NULL END AS athena, CASE WHEN microservice_name = 'demeter' THEN num_instances ELSE NULL END AS demeter, CASE WHEN microservice_name = 'hercules' THEN num_instances ELSE NULL END AS hercules, CASE WHEN microservice_name = 'zeus' THEN num_instances ELSE NULL END AS zeus FROM ( SELECT microservice_name, SUM(num_instances) AS num_instances FROM ( SELECT microservice_name, COUNT(DISTINCT instance_name) as num_instances FROM "raw_data"."devops" WHERE time BETWEEN from_milliseconds(1636526171043) AND from_milliseconds(1636612571043) AND measure_name = 'metrics' GROUP BY region, cell, silo, availability_zone, microservice_name ) GROUP BY microservice_name )
Converting to a scheduled query
The previous query can be converted into a scheduled query as follows. You first
compute the distinct host names within a given deployment in a region, cell, silo,
availability zone and microservice. Then you add up the hosts to compute a per hour
per microservice host count. By using the @scheduled_runtime
parameter
supported by the scheduled queries, you can recompute it for the past hour when the
query is invoked. The bin(@scheduled_runtime, 1h)
in the
WHERE
clause of the inner query ensures that even if the query is
scheduled at a time in the middle of the hour, you still get the data for the full
hour.
Even though the query computes hourly aggregates, as you will see in the scheduled computation configuration, it is set up to refresh every half hour so that you get updates in your derived table sooner. You can tune that based on your freshness requirements, e.g., recompute the aggregates every 15 minutes or recompute it at the hour boundaries.
SELECT microservice_name, hour, SUM(num_instances) AS num_instances FROM ( SELECT microservice_name, bin(time, 1h) AS hour, COUNT(DISTINCT instance_name) as num_instances FROM raw_data.devops WHERE time BETWEEN bin(@scheduled_runtime, 1h) - 1h AND @scheduled_runtime AND measure_name = 'metrics' GROUP BY region, cell, silo, availability_zone, microservice_name, bin(time, 1h) ) GROUP BY microservice_name, hour
{ "Name": "MultiPT30mHostCountMicroservicePerHr", "QueryString": "SELECT microservice_name, hour, SUM(num_instances) AS num_instances FROM ( SELECT microservice_name, bin(time, 1h) AS hour, COUNT(DISTINCT instance_name) as num_instances FROM raw_data.devops WHERE time BETWEEN bin(@scheduled_runtime, 1h) - 1h AND @scheduled_runtime AND measure_name = 'metrics' GROUP BY region, cell, silo, availability_zone, microservice_name, bin(time, 1h) ) GROUP BY microservice_name, hour", "ScheduleConfiguration": { "ScheduleExpression": "cron(0/30 * * * ? *)" }, "NotificationConfiguration": { "SnsConfiguration": { "TopicArn": "******" } }, "TargetConfiguration": { "TimestreamConfiguration": { "DatabaseName": "derived", "TableName": "host_count_pt1h", "TimeColumn": "hour", "DimensionMappings": [ { "Name": "microservice_name", "DimensionValueType": "VARCHAR" } ], "MultiMeasureMappings": { "TargetMultiMeasureName": "num_instances", "MultiMeasureAttributeMappings": [ { "SourceColumn": "num_instances", "MeasureValueType": "BIGINT" } ] } } }, "ErrorReportConfiguration": { "S3Configuration" : { "BucketName" : "******", "ObjectKeyPrefix": "errors", "EncryptionOption": "SSE_S3" } }, "ScheduledQueryExecutionRoleArn": "******" }
Using the pre-computed results in a new dashboard
You will now see how to create your aggregate view dashboard using the derived table from the scheduled query you created. From the dashboard snapshot, you will also be able to validate that the aggregates computed from the derived table and the base table also match. Once you create the dashboards using the derived tables, you will notice the significantly faster load time and lower costs of using the derived tables compared to computing these aggregates from the raw data. Below is a snapshot of the dashboard using pre-computed data, and the query used to render this panel using pre-computed data stored in the table "derived"."host_count_pt1h". Note that the structure of the query is very similar to the query that was used in the dashboard on raw data, except that is it using the derived table which already computes the distinct counts which this query is aggregating.
SELECT CASE WHEN microservice_name = 'apollo' THEN num_instances ELSE NULL END AS apollo, CASE WHEN microservice_name = 'athena' THEN num_instances ELSE NULL END AS athena, CASE WHEN microservice_name = 'demeter' THEN num_instances ELSE NULL END AS demeter, CASE WHEN microservice_name = 'hercules' THEN num_instances ELSE NULL END AS hercules, CASE WHEN microservice_name = 'zeus' THEN num_instances ELSE NULL END AS zeus FROM ( SELECT microservice_name, AVG(num_instances) AS num_instances FROM ( SELECT microservice_name, bin(time, 1h), SUM(num_instances) as num_instances FROM "derived"."host_count_pt1h" WHERE time BETWEEN from_milliseconds(1636567785421) AND from_milliseconds(1636654185421) AND measure_name = 'num_instances' GROUP BY microservice_name, bin(time, 1h) ) GROUP BY microservice_name )