Best practices for designing Amazon Redshift queries
This section provides an overview of best practices for designing queries. We recommend that you follow the best practices in this section to achieve optimal query performance and efficiency.
Avoid using the SELECT * FROM statement
We recommend that you avoid using the SELECT * FROM
statement. Instead,
always list out columns for analysis. This reduces query execution time and scans costs for
Amazon Redshift Spectrum queries.
Example of what to avoid
select * from sales;
Best practice example
select sales_date, sales_amt from sales;
Identify query issues
We recommend that you check the STL_ALERT_EVENT_LOG view to identify and correct possible issues with your query.
Get summary information on your query
We recommend that you use the SVL_QUERY_SUMMARY and SVL_QUERY_REPORT views to get summary information on your queries. You can use this information to optimize your queries.
Avoid cross-joins
We recommend that you avoid using cross-joins unless absolutely necessary. Without a join condition, cross-joins result in the Cartesian product of two tables. Cross-joins are typically run as nested-loop joins (the slowest of the possible join types).
Example of what to avoid
select c.c_name, n.n_name from tpch.customer c, tpch.nation n;
Best practice example
select c.c_name, n.n_name from tpch.customer c, join tpch.nation n on n.n_nationkey = c.c_nationkey;
Avoid functions in query predicates
We recommend that you avoid using functions in query predicates. Using functions in query predicates can negatively impact performance because functions typically add extra processing overhead to each row and slow down the overall execution of the query.
Example of what to avoid
select sum(o_totalprice) from tpch.orders where datepart(year, o_orderdate) = 1992;
Best practice example
select sum(o_totalprice) from tpch.orders where o_orderdate between '1992-01-01' and '1992-12-31';
Avoid unnecessary cast conversions
We recommend that you avoid using unnecessary cast conversion on the queries because casting data types takes time and resources and slows down query execution.
Example of what to avoid
select sum(o_totalprice) from tpch.orders where o_ordertime::date = '1992-01-01';
Best practice example
select sum(o_totalprice) from tpch.orders where o_ordertime between '1992-01-01 00:00:00' and '1992-12-31 23:59:59';
Use CASE expressions for complex aggregations
We recommend that you use a CASE expression to perform complex aggregations instead of selecting from the same table multiple times.
Example of what to avoid
select sum(sales_amt) as us_sales from sales where country = 'US'; select sum(sales_amt) as ca_sales from sales where country = 'CA';
Best practice example
select sum(case when country = 'US' then sales_amt end) as us_sales, sum(case when country = 'CA' then sales_amt end) as ca_sales from sales;
Use subqueries
We recommend that you use subqueries in cases where one table in the query is used only for predicate conditions and the subquery returns a small number of rows (less than about 200).
Example of what to avoid
If a subquery returns less than 200 rows:
select sum(order_amt) as total_sales from sales where region_key IN (select region_key from regions where state = 'CA');
Best practice example
If a subquery returns greater than or equal to 200 rows:
select sum(o.order_amt) as total_sales from sales o join regions r on r.region_key = o.region_key and r.state = 'CA';
Use predicates
We recommend that you use predicates to restrict the dataset as much as possible. Predicates are used in SQL to filter and restrict the data that's returned in a query. By specifying conditions in a predicate, you can specify which rows must be included in the query results based on specified conditions. This allows you to retrieve only the data that you're interested in and improves the efficiency and accuracy of your queries. For more information, see Conditions in the Amazon Redshift documentation.
Add predicates to filter tables with joins
We recommend that you add predicates to filter tables that participate in joins, even if
the predicates apply the same filters. Using predicates to filter tables with joins in SQL can
improve query performance by reducing the amount of data that must be processed and reducing
the size of the intermediate result set. By specifying the conditions for the join operation
in the WHERE
clause, the query execution engine can eliminate rows that don't
match the conditions before they are joined. This results in a smaller result set and faster
query execution.
Example of what to avoid
select p.product_name, sum(o.order_amt) from sales o join product p on r.product_key = o.product_key where o.order_date > '2022-01-01';
Best practice example
select p.product_name, sum(o.order_amt) from sales o join product p on p.product_key = o.product_key and p.added_date > '2022-01-01' where o.order_date > '2022-01-01';
Use the least expensive operators for predicates
In the predicate, use the least expensive operators that you can. Comparison
condition operators are preferable to LIKE operators.
LIKE
operators are still preferable to SIMILAR TO or
POSIX operators.
Use sort keys in GROUP BY clauses
Use sort keys in the GROUP BY
clause so that the query planner can use more
efficient aggregation. A query might qualify for one-phase aggregation when its GROUP
BY
list contains only sort key columns, one of which is also the distribution key.
The sort key columns in the GROUP BY
list must include the first sort key,
followed by other sort keys that you want to use in sort key order.
Take advantage of materialized views
If possible, rewrite the query by replacing the complex code with a materialized view, which will significantly improve the performance of the query. For more information, see Creating materialized views in Amazon Redshift in the Amazon Redshift documentation.
Be careful with columns in GROUP BY and ORDER BY clauses
If you use both GROUP BY
and ORDER BY
clauses, make sure that
you put the columns in the same order in both GROUP BY
and ORDER BY
clauses. GROUP BY
implicitly requires the data to be sorted. If your ORDER
BY
clause is different, then the data must be sorted twice.
Example of what to avoid
select a, b, c, sum(d) from a_table group by b, c, a order by a, b, c
Best practice example
select a, b, c, sum(d) from a_table group by a, b, c order by a, b, c