Menu
Amazon Redshift
Database Developer Guide (API Version 2012-12-01)

WLM Query Monitoring Rules

In Amazon Redshift workload management (WLM), query monitoring rules define metrics-based performance boundaries for WLM queues and specify what action to take when a query goes beyond those boundaries. For example, for a queue dedicated to short running queries, you might create a rule that aborts queries that run for more than 60 seconds. To track poorly designed queries, you might have another rule that logs queries that contain nested loops.

You define query monitoring rules as part of your workload management (WLM) configuration. You can define up to eight rules for each queue, with a limit of eight rules for all queues. Each rule includes up to three conditions, or predicates, and one action. A predicate consists of a metric, a comparison condition (=, <, or > ), and a value. If all of the predicates for any rule are met, that rule's action is triggered. Possible rule actions are log, hop, and abort, as discussed following.

The rules in a given queue apply only to queries running in that queue. A rule is independent of other rules.

WLM evaluates metrics every 10 seconds. If more than one rule is triggered during the same period, WLM initiates the most severe action—abort, then hop, then log. If the action is hop or abort, the action is logged and the query is evicted from the queue. If the action is log, the query continues to run in the queue. WLM initiates only one log action per query per rule. If the queue contains other rules, those rules remain in effect. If the action is hop and the query is routed to another queue, the rules for the new queue apply.

When all of a rule's predicates are met, WLM writes a row to the STL_WLM_RULE_ACTION system table. In addition, Amazon Redshift records query metrics for currently running queries to STV_QUERY_METRICS. Metrics for completed queries are stored in STL_QUERY_METRICS.

Defining a Query Monitor Rule

You create query monitoring rules are part of your WLM configuration, which you define as part of your cluster's parameter group definition.

You can create rules using the AWS Management Console or programmatically using JSON.

Note

If you choose to create rules programmatically, we strongly recommend using the console to generate the JSON that you include in the parameter group definition. For more information, see Creating or Modifying a Query Monitoring Rule Using the Console and Configuring Parameter Values Using the AWS CLI in the Amazon Redshift Cluster Management Guide.

To define a query monitoring rule, you specify the following elements:

  • A rule name – Rule names must be unique within the WLM configuration. Rule names can be up to 32 alphanumeric characters or underscores, and can't contain spaces or quotation marks. You can have up to eight rules per queue, and the total limit for all queues is eight rules.

  • One or more predicates – You can have up to three predicates per rule. If all the predicates for any rule are met, the associated action is triggered. A predicate is defined by a metric name, an operator ( =, <, or > ), and a value. An example is query_cpu_time > 100000. For a list of metrics and examples of values for different metrics, see Query Monitoring Metrics following in this section.

  • An action – If more than one rule is triggered, WLM chooses the rule with the most severe action. Possible actions, in ascending order of severity, are:

    • Log – Record information about the query in the STL_WLM_RULE_ACTION system table. Use the Log action when you want to only write a log record. WLM creates at most one log per query, per rule. Following a log action, other rules remain in force and WLM continues to monitor the query.

    • Hop – Log the action, terminate the query, and restart it the next matching queue. If there is not another matching queue, the query is canceled. For more information, see WLM Query Queue Hopping.

    • Abort – Log the action and terminate the query.

For steps to create or modify a query monitoring rule, see Creating or Modifying a Query Monitoring Rule Using the Console and Properties in the wlm_json_configuration Parameter in the Amazon Redshift Cluster Management Guide.

You can find more information about query monitoring rules in the following topics:

Query Monitoring Metrics

The following table describes the metrics used in query monitoring rules. (These metrics are distinct from the metrics stored in the STV_QUERY_METRICS and STL_QUERY_METRICS system tables.)

For a given metric, the performance threshold is tracked either at the query level or the segment level. For more information about segments and steps, see Query Planning And Execution Workflow.

Note

The WLM Timeout parameter is distinct from query monitoring rules.

Metric Name Description
Query CPU time query_cpu_time CPU time used by the query, in seconds. CPU time is distinct from Query execution time.

Valid values are 0 to 10^6.

Blocks read query_blocks_read Number of 1 MB data blocks read by the query.

Valid values are 0 to 1024^2.

Scan row count scan_row_count The number of rows in a scan step.

Valid values are 0 to 1024^4

Query execution time query_execution_time Elapsed execution time for a query, in seconds. Execution time doesn't include time spent waiting in a queue.

Valid values are 0 to 86399.

CPU usage query_cpu_usage_percent Percent of CPU capacity used by the query.

Valid values are 0 to 6399.

Memory to disk query_temp_blocks_to_disk Temporary disk space used to write intermediate results, in 1 MB blocks.

Valid values are 0 to 31981567.

CPU skew cpu_skew The ratio of maximum CPU usage for any slice to average CPU usage for all slices. This metric is defined at the segment level.

Valid values are 0 to 99.

I/O skew io_skew The ratio of maximum blocks read (I/O) for any slice to average blocks read for all slices. This metric is defined at the segment level.

Valid values are 0 to 99.

Rows joined join_row_count The number of rows processed in a join step.

Valid values are 0 to 10^15 .

Nested loop join row count nested_loop_join_row_count The number or rows in a nested loop join.

Valid values are 0 to 10^15.

Return row count return_row_count The number of rows returned by the query.

Valid values are 0 to 10^15.

Segment execution time segment_execution_time Elapsed execution time for a single segment, in seconds.

Valid values are 0 to 86388

Spectrum scan row count spectrum_scan_row_count The number of rows of data in Amazon S3 scanned by an Amazon Redshift Spectrum query.

Valid values are 0 to 10^15

Spectrum scan size spectrum_scan_size_mb The size of data in Amazon S3, in MB, scanned by an Amazon Redshift Spectrum query.

Valid values are 0 to 10^15

Query Monitoring Rules Templates

When you add a rule using the Amazon Redshift console, you can choose to create a rule from a predefined template. Amazon Redshift creates a new rule with a set of predicates and populates the predicates with default values. The default action is log. You can modify the predicates and action to meet your use case.

The following table lists available templates.

Template Name Predicates Description
Nested loop join nested_loop_join_row_count > 100 A nested loop join might indicate an incomplete join predicate, which often results in a very large return set (a Cartesian product). Use a low row count to find a potentially runaway query early.
Query returns a high number of rows return_row_count > 1000000 If you dedicate a queue to simple, short running queries, you might include a rule that finds queries returning a high row count. The template uses a default of 1 million rows. For some systems, you might consider one million rows to be high, or in a larger system, a billion or more rows might be high.
Join with a high number of rows join_row_count > 1000000000 A join step that involves an unusually high number of rows might indicate a need for more restrictive filters. The template uses a default of 1 billion rows. For an ad hoc queue that's intended for quick, simple queries, you might use a lower number.
High disk usage when writing intermediate results query_temp_blocks_to_disk > 100000 When currently executing queries use more than the available system RAM, the query execution engine writes intermediate results to disk (spilled memory). Typically, this condition is the result of a rogue query, which usually is also the query that uses the most disk space. The acceptable threshold for disk usage varies based on the cluster node type and number of nodes. The template uses a default of 100,000 blocks, or 100 GB. For a small cluster, you might use a lower number.
Long running query with high I/O skew segment_execution_time > 120 and io_skew > 1.30 I/O skew occurs when one node slice has a much higher I/O rate than the other slices. As a starting point, a skew of 1.30 (1.3 times average) is considered high. High I/O skew is not always a problem, but when combined with a long running query time, it might indicate a problem with the distribution style or sort key.

System Tables for Query Monitoring Rules

When all of a rule's predicates are met, WLM writes a row to the STL_WLM_RULE_ACTION system table with details for the query that triggered the rule and the resulting action.

In addition, Amazon Redshift records query metrics to two system tables.