Aggregation analysis rule - AWS Clean Rooms

Aggregation analysis rule

Note

Applies to: AWS Clean Rooms SQL analytics engine

In AWS Clean Rooms, an aggregation analysis rule generates aggregate statistics using COUNT, SUM, and/or AVG functions along optional dimensions. When the aggregation analysis rule is added to a configured table, it enables the member who can query to run queries on the configured table.

The aggregation analysis rule supports uses cases such as campaign planning, media reach, frequency measurement, and attribution.

The supported query structure and syntax are defined in Aggregation query structure and syntax.

The parameters of the analysis rule, defined in Aggregation analysis rule - query controls, include query controls and query results controls. Its query controls include the ability to require that a configured table is joined to at least one configured table owned by the member who can query, either directly or transitively. This requirement allows you to ensure that the query is run on the intersection (INNER JOIN) of your table and theirs.

Aggregation query structure and syntax

Queries on tables that have an aggregation analysis rule must adhere to the following syntax.

--select_aggregate_function_expression SELECT aggregation_function(column_name) [[AS] column_alias ] [, ...] --select_grouping_column_expression [, {column_name|scalar_function(arguments)} [[AS] column_alias ]][, ...] --table_expression FROM table_name [[AS] table_alias ] [[INNER] JOIN table_name [[AS] table_alias] ON join_condition] [...] --where_expression [WHERE where_condition] --group_by_expression [GROUP BY {column_name|scalar_function(arguments)}, ...]] --having_expression [HAVING having_condition] --order_by_expression [ORDER BY {column_name|scalar_function(arguments)} [{ASC|DESC}]] [,...]]

The following table explains each expression listed in the preceding syntax.

Expression Definition Examples
select_aggregate_function_expression

A comma-separated list containing the following expressions:

  • select_aggregation_function_expression

  • select_aggregate_expression

Note

There must be at least one select_aggregation_function_expression in the select_aggregate_expression.

SELECT SUM(PRICE), user_segment

select_aggregation_function_expression

One or more supported aggregation functions applied to one or more columns. Only columns are allowed as arguments of aggregation functions.

Note

There must be at least one select_aggregation_function_expression in the select_aggregate_expression.

AVG(PRICE)

COUNT(DISTINCT user_id)

select_grouping_column_expression

An expression that can contain any expression using the following:

  • Table column names

  • Supported scalar functions

  • String literals

  • Numerical literals

Note

select_aggregate_expression can alias columns with or without the AS parameter. For more information, see the AWS Clean Rooms SQL Reference.

TRUNC(timestampColumn)

UPPER(campaignName)

table_expression

A table, or join of tables, connecting join conditional expressions with join_condition.

join_condition returns a Boolean.

The table_expression supports:

  • A specific JOIN type (INNER JOIN)

  • The equality comparison condition within a join_condition (=)

  • Logical operators (AND, OR).

FROM consumer_table INNER JOIN provider_table ON consumer_table.identifier1 = provider_table.identifier1 AND consumer_table.identifier2 = provider_table.identifier2
where_expression

A conditional expression that returns a Boolean. It may be comprised of the following:

  • Table column names

  • Supported scalar functions

  • Mathematical operators

  • String literals

  • Numerical literals

Supported comparison conditions are (=, >, <, <=, >=, <>, !=, NOT, IN, NOT IN, LIKE, IS NULL, IS NOT NULL).

Supported logical operators are (AND, OR).

The where_expression is optional.

WHERE where_condition

WHERE price > 100

WHERE TRUNC(timestampColumn) = '1/1/2022'

WHERE timestampColumn = timestampColumn2 - 14

group_by_expression

A comma-separated list of expressions that match the requirements for the select_grouping_column_expression.

GROUP BY TRUNC(timestampColumn), UPPER(campaignName), segment

having_expression

A conditional expression that returns an Boolean. They have a supported aggregation function applied to a single column (for example, SUM(price)) and are compared to a numerical literal.

Supported conditions are (=, >, <, <=, >=, <>, !=).

Supported logical operators are (AND, OR).

The having_expression is optional.

HAVING SUM(SALES) > 500

order_by_expression

A comma-separated list of expressions that is compatible with the same requirements defined in select_aggregate_expression defined earlier.

The order_by_expression is optional.

Note

order_by_expression permits ASC and DESC parameters. For more information, see ASC DESC parameters in the AWS Clean Rooms SQL Reference.

ORDER BY SUM(SALES), UPPER(campaignName)

For aggregation query structure and syntax, be aware of the following:

  • SQL commands other than SELECT are not supported.

  • Sub-queries and common table expressions (for example, WITH) are not supported.

  • Operators that combine multiple queries (for example, UNION) are not supported.

  • TOP, LIMIT, and OFFSET parameters are not supported.

Aggregation analysis rule - query controls

With aggregation query controls, you can control how the columns in your table are used to query the table. For example, you can control which column is used for joining, which column can be counted, or which column can be used in WHERE statements.

The following sections explain each control.

Aggregation controls

By using aggregation controls, you can define which aggregation functions to allow, and what columns they must to be applied to. Aggregation functions can be used in the SELECT, HAVING, and ORDER BY expressions.

Control Definition Usage
aggregateColumns Columns of configured table columns you allow for use within aggregation functions.

aggregateColumns can be used inside an aggregation function in the SELECT, HAVING, and ORDER BY expressions.

Some aggregateColumns can also be categorized as a joinColumn (defined later).

Given aggregateColumn can't also be categorized as a dimensionColumn (defined later).

function The COUNT, SUM, and AVG functions you allow for use on top of the aggregateColumns.

function can be applied to an aggregateColumns that is associated to it.

Join controls

A JOIN clause is used to combine rows from two or more tables, based on a related column between them.

You can use Join controls to control how your table can be joined to other tables in the table_expression. AWS Clean Rooms only supports INNER JOIN. INNER JOIN statements can only use columns that have been explicitly categorized as a joinColumn in your analysis rule, subject to the controls that you define.

The INNER JOIN must operate on a joinColumn from your configured table and a joinColumn from another configured table in the collaboration. You decide which columns from your table can be used as joinColumn.

Each match condition within the ON clause is required to use the equality comparison condition (=) between two columns.

Multiple match conditions within an ON clauses can be:

  • Combined using the AND logical operator

  • Separated using the OR logical operator

Note

All JOIN match conditions must match one row from each side of the JOIN. All conditionals connected by an OR or an AND logical operator must adhere to this requirement as well.

The following is an example of a query with an AND logical operator.

SELECT some_col, other_col FROM table1 JOIN table2 ON table1.id = table2.id AND table1.name = table2.name

The following is an example of a query with an OR logical operator.

SELECT some_col, other_col FROM table1 JOIN table2 ON table1.id = table2.id OR table1.name = table2.name
Control Definition Usage
joinColumns The columns (if any) that you want to allow the member who can query to use in the INNER JOIN statement.

A specific joinColumn can also be categorized as a aggregateColumn (see Aggregation controls).

The same column can't be used both as joinColumn and dimensionColumns (see later).

Unless it has also been categorized as an aggregateColumn, a joinColumn can't be used in any other parts of the query other than the INNER JOIN.

joinRequired Control whether you require an INNER JOIN with a configured table from the member who can query.

If you enable this parameter, an INNER JOIN is required. If you don't enable this parameter, an INNER JOIN is optional.

Assuming you enable this parameter, the member who can query is required to include a table they own in the INNER JOIN. They must JOIN your table with theirs, either directly or transitively (that is, join their table to another table, which itself is joined to your table).

Following is an example of transitivity.

ON my_table.identifer = third_party_table.identifier .... ON third_party_table.identifier = member_who_can_query_table.id
Note

The member who can query can also use the joinRequired parameter. In that case, the query must join their table with at least one other table.

Dimension controls

Dimension controls control the column along which the aggregation columns can be filtered, grouped, or aggregated.

Control Definition Usage
dimensionColumns

The columns (if any) that you allow the member who can query to use in SELECT, WHERE, GROUP BY, and ORDER BY.

A dimensionColumn can be used in SELECT (select_grouping_column_expression), WHERE, GROUP BY, and ORDER BY.

The same column can't be both a dimensionColumn, a joinColumn, and/or an aggregateColumn.

Scalar functions

Scalar functions control which scalar functions can be used on dimension columns.

Control Definition Usage
scalarFunctions

The scalar functions that can be used on dimensionColumns in the query.

Specifies the scalar functions (if any) that you allow (for example, CAST) to be applied on dimensionColumns.

Scalar functions can't be used on top of other functions or within other functions. Arguments of scalar functions can be columns, string literals, or numeric literals.

The following scalar functions are supported:

  • Math functions – ABS, CEILING, FLOOR, LOG, LN, ROUND, SQRT

  • Data type formatting functions – CAST, CONVERT, TO_CHAR, TO_DATE, TO_NUMBER, TO_TIMESTAMP

  • String functions – LOWER, UPPER, TRIM, RTRIM, SUBSTRING

    • For RTRIM, custom character sets to trim aren't allowed.

  • Conditional expressions – COALESCE

  • Date functions – EXTRACT, GETDATE, CURRENT_DATE, DATEADD

  • Other functions – TRUNC

For more details, see the AWS Clean Rooms SQL Reference.

Aggregation analysis rule - query results controls

With the aggregation query results controls, you can control which results are returned by specifying one or more conditions that each output row must meet for it to be returned. AWS Clean Rooms supports aggregation constraints in the form of COUNT (DISTINCT column) >= X. This form requires that each row aggregates at least X distinct values of a choice from your configured table (for example, a minimum number of distinct user_id values). This minimum threshold is automatically enforced, even if the submitted query itself does not use the specified column. They are enforced collectively across each configured table in the query from the configured tables from each member in the collaboration.

Each configured table must have at least one aggregation constraint in their analysis rule. Configured table owners can add multiple columnName and associated minimum and they are enforced collectively.

Aggregation constraints

Aggregation constraints control which rows in the query results are returned. To be returned, a row must meet the specified minimum number of distinct values in each column specified in the aggregation constraint. This requirement applies even if the column isn't explicitly mentioned in the query or in other parts of the analysis rule.

Control Definition Usage
columnName

The aggregateColumn that is used in the condition that each output row must meet.

Can be any column in the configured table.

minimum

The minimum number of distinct values for the associated aggregateColumn that the output row must have (for example, COUNT DISTINCT) for it to be returned in the query results.

The minimum must be at least value of 2.

Aggregation analysis rule structure

The following example shows a predefined structure for an aggregation analysis rule.

In the following example, MyTable refers to your data table. You can replace each user input placeholder with your own information.

{ "aggregateColumns": [ { "columnNames": [MyTable column names], "function": [Allowed Agg Functions] }, ], "joinRequired": ["QUERY_RUNNER"], "joinColumns": [MyTable column names], "dimensionColumns": [MyTable column names], "scalarFunctions": [Allowed Scalar functions], "outputConstraints": [ { "columnName": [MyTable column names], "minimum": [Numeric value] }, ] }

Aggregation analysis rule - example

The following example demonstrates how two companies can collaborate in AWS Clean Rooms using aggregation analysis.

Company A has customer and sales data. Company A is interested in understanding product return activity. Company B is one of Company A's retailers and has returns data. Company B also has segment attributes on customers that are useful to Company A (for example, purchased related products, uses customer service from the retailer). Company B doesn't want to provide row-level customer return data and attribute information. Company B only wants to enable a set of queries for Company A to obtain aggregate statistics on overlapping customers at a minimum aggregation threshold.

Company A and Company B decide to collaborate so that Company A can understand product return activity and deliver better products at Company B and other channels.

To create the collaboration and run an aggregation analysis, the companies do the following:

  1. Company A creates a collaboration and creates a membership. The collaboration has Company B as another member in the collaboration. Company A enables query logging in the collaboration, and it enables query logging in their account.

  2. Company B creates a membership in the collaboration. It enables query logging in its account.

  3. Company A creates a sales configured table.

  4. Company A adds the following aggregation analysis rule to the sales configured table.

    { "aggregateColumns": [ { "columnNames": [ "identifier" ], "function": "COUNT_DISTINCT" }, { "columnNames": [ "purchases" ], "function": "AVG" }, { "columnNames": [ "purchases" ], "function": "SUM" } ], "joinColumns": [ "hashedemail" ], "dimensionColumns": [ "demoseg", "purchasedate", "productline" ], "scalarFunctions": [ "CAST", "COALESCE", "TRUNC" ], "outputConstraints": [ { "columnName": "hashedemail", "minimum": 2, "type": "COUNT_DISTINCT" }, ] }

    aggregateColumns – Company A wants to count the number of unique customers in the overlap between sales data and returns data. Company A also wants to sum the number of purchases made to compare to number of returns.

    joinColumns – Company A wants to use identifier to match customers from sales data to customers from returns data. This will help Company A match returns to the right purchases. It also helps Company A segment overlapping customers.

    dimensionColumns – Company A uses dimensionColumns to filter by the specific product, compare purchases and returns over a certain period of time, ensure the return date is after the product date, and help segment overlapping customers.

    scalarFunctions – Company A selects CAST scalar function to help update data type formats if needed based on the configured table Company A associates to the collaboration. It also adds scalar functions to help formatting columns if needed.

    outputConstraints – Company A sets minimum output constraints. It doesn't need to constrain the results because the analyst is allowed to see row-level data from their sales table

    Note

    Company A doesn't include joinRequired in the analysis rule. It provides flexibility for their analyst to query the sales table alone.

  5. Company B creates a returns configured table.

  6. Company B adds the following aggregation analysis rule to the returns configured table.

    { "aggregateColumns": [ { "columnNames": [ "identifier" ], "function": "COUNT_DISTINCT" }, { "columnNames": [ "returns" ], "function": "AVG" }, { "columnNames": [ "returns" ], "function": "SUM" } ], "joinColumns": [ "hashedemail" ], "joinRequired": [ "QUERY_RUNNER" ], "dimensionColumns": [ "state", "popularpurchases", "customerserviceuser", "productline", "returndate" ], "scalarFunctions": [ "CAST", "LOWER", "UPPER", "TRUNC" ], "outputConstraints": [ { "columnName": "hashedemail", "minimum": 100, "type": "COUNT_DISTINCT" }, { "columnName": "producttype", "minimum": 2, "type": "COUNT_DISTINCT" } ] }

    aggregateColumns – Company B enables Company A to sum returns to compare to the number of purchases. They have at least one aggregate column because they are enabling an aggregate query.

    joinColumns – Company B enables Company A to join on identifier to match customers from return data to customers from sales data. identifier data is particularly sensitive and having it as a joinColumn ensures that the data will never be outputted in a query.

    joinRequired – Company B requires queries on the return data to be overlapped with the sales data. They don’t want to enable Company A to query all individuals in their dataset. They also agreed on that restriction in their collaboration agreement.

    dimensionColumns – Company B enables Company A to filter and group by state, popularpurchases, and customerserviceuser which are unique attributes that could help make the analysis for Company A. Company B enables Company A to use returndate to filter output on returndate that occurs after purchasedate. With this filtering, the output is more accurate for evaluating the impact of the product change.

    scalarFunctions – Company B enables the following:

    • TRUNC for dates

    • LOWER and UPPER in case the producttype is entered in a different format in their data

    • CAST if Company A needs to convert data types in sales to be the same as data types in returns

    Company A doesn't enable other scalar functions because they don’t believe they are required for queries.

    outputConstraints – Company B sets minimum output constraints on hashedemail to help reduce the ability to re-identify customers. It also adds minimum output constraint on producttype to reduce the ability to re-identify specific products that were returned. Certain product types could be more dominant based on dimensions of the output (for example, state). Their output constraints will always be enforced regardless of output constraints added by Company A to their data.

  7. Company A creates a sales table association to collaboration.

  8. Company B creates a returns table association to collaboration.

  9. Company A runs queries, such as the following example, to better understand the quantity of returns in Company B as compared to total purchases by location in 2022.

    SELECT companyB.state, SUM(companyB.returns), COUNT(DISTINCT companyA.hashedemail) FROM sales companyA INNER JOIN returns companyB ON companyA.identifier = companyB.identifier WHERE companyA.purchasedate BETWEEN '2022-01-01' AND '2022-12-31' AND TRUNC(companyB.returndate) > companyA.purchasedate GROUP BY companyB.state;
  10. Company A and Company B review query logs. Company B verifies that the query aligns with what was agreed upon in the collaboration agreement.

Troubleshooting aggregation analysis rule issues

Use the information here to help you diagnose and fix common issues when you work with aggregation analysis rules.

My query didn’t return any results

This can happen when there are no matching results or when the matching results don’t meet one or more minimum aggregation thresholds.

For more information about minimum aggregation thresholds, see Aggregation analysis rule - example.