Best practices for designing queries - AWS Prescriptive Guidance

Best practices for designing queries

This section provides an overview of best practices for designing queries. We recommend that you follow the best practices in this section to achieve optimal query performance and efficiency.

Avoid using the SELECT * FROM statement

We recommend that you avoid using the SELECT * FROM statement. Instead, always list out columns for analysis. This reduces query execution time and scans costs for Amazon Redshift Spectrum queries.

Example of what to avoid

select * from sales;

Best practice example

select sales_date, sales_amt from sales;

Identify query issues

We recommend that you check the STL_ALERT_EVENT_LOG view to identify and correct possible issues with your query.

Get summary information on your query

We recommend that you use the SVL_QUERY_SUMMARY and SVL_QUERY_REPORT views to get summary information on your queries. You can use this information to optimize your queries.

Avoid cross-joins

We recommend that you avoid using cross-joins unless absolutely necessary. Without a join condition, cross-joins result in the Cartesian product of two tables. Cross-joins are typically run as nested-loop joins (the slowest of the possible join types).

Example of what to avoid

select c.c_name, n.n_name from tpch.customer c, tpch.nation n;

Best practice example

select c.c_name, n.n_name from tpch.customer c, join tpch.nation n on n.n_nationkey = c.c_nationkey;

Avoid functions in query predicates

We recommend that you avoid using functions in query predicates. Using functions in query predicates can negatively impact performance because functions typically add extra processing overhead to each row and slow down the overall execution of the query.

Example of what to avoid

select sum(o_totalprice) from tpch.orders where datepart(year, o_orderdate) = 1992;

Best practice example

select sum(o_totalprice) from tpch.orders where o_orderdate between '1992-01-01' and '1992-12-31';

Avoid unnecessary cast conversions

We recommend that you avoid using unnecessary cast conversion on the queries because casting data types takes time and resources and slows down query execution.

Example of what to avoid

select sum(o_totalprice) from tpch.orders where o_ordertime::date = '1992-01-01';

Best practice example

select sum(o_totalprice) from tpch.orders where o_ordertime between '1992-01-01 00:00:00' and '1992-12-31 23:59:59';

Use CASE expressions for complex aggregations

We recommend that you use a CASE expression to perform complex aggregations instead of selecting from the same table multiple times.

Example of what to avoid

select sum(sales_amt) as us_sales from sales where country = 'US'; select sum(sales_amt) as ca_sales from sales where country = 'CA';

Best practice example

select sum(case when country = 'US' then sales_amt end) as us_sales sum(case when country = 'CA' then sales_amt end) as ca_sales from sales;

Use subqueries

We recommend that you use subqueries in cases where one table in the query is used only for predicate conditions and the subquery returns a small number of rows (less than about 200).

Example of what to avoid

If a subquery returns less than 200 rows:

select sum(order_amt) as total_sales from sales where region_key IN (select region_key from regions where state = 'CA');

Best practice example

If a subquery returns greater than or equal to 200 rows:

select sum(o.order_amt) as total_sales from sales o join regions r on r.region_key = o.region_key and r.state = 'CA';

Use predicates

We recommend that you use predicates to restrict the dataset as much as possible. Predicates are used in SQL to filter and restrict the data that's returned in a query. By specifying conditions in a predicate, you can specify which rows must be included in the query results based on specified conditions. This allows you to retrieve only the data that you're interested in and improves the efficiency and accuracy of your queries. For more information, see Conditions in the Amazon Redshift documentation.

Add predicates to filter tables with joins

We recommend that you add predicates to filter tables that participate in joins, even if the predicates apply the same filters. Using predicates to filter tables with joins in SQL can improve query performance by reducing the amount of data that must be processed and reducing the size of the intermediate result set. By specifying the conditions for the join operation in the WHERE clause, the query execution engine can eliminate rows that don't match the conditions before they are joined. This results in a smaller result set and faster query execution.

Example of what to avoid

select p.product_name, sum(o.order_amt) from sales o join product p on r.product_key = o.product_key where o.order_date > '2022-01-01';

Best practice example

select p.product_name, sum(o.order_amt) from sales o join product p on p.product_key = o.product_key and p.added_date > '2022-01-01' where o.order_date > '2022-01-01';

Use the least expensive operators for predicates

In the predicate, use the least expensive operators that you can. Comparison condition operators are preferable to LIKE operators. LIKE operators are still preferable to SIMILAR TO or POSIX operators.

Use sort keys in GROUP BY clauses

Use sort keys in the GROUP BY clause so that the query planner can use more efficient aggregation. A query might qualify for one-phase aggregation when its GROUP BY list contains only sort key columns, one of which is also the distribution key. The sort key columns in the GROUP BY list must include the first sort key, followed by other sort keys that you want to use in sort key order.

Take advantage of materialized views

If possible, rewrite the query by replacing the complex code with a materialized view, which will significantly improve the performance of the query. For more information, see Creating materialized views in Amazon Redshift in the Amazon Redshift documentation.

Be careful with columns in GROUP BY and ORDER BY clauses

If you use both GROUP BY and ORDER BY clauses, make sure that you put the columns in the same order in both GROUP BY and ORDER BY clauses. GROUP BY implicitly requires the data to be sorted. If your ORDER BY clause is different, then the data must be sorted twice.

Example of what to avoid

select a, b, c, sum(d) from a_table group by b, c, a order by a, b, c

Best practice example

select a, b, c, sum(d) from a_table group by a, b, c order by a, b, c