Working with sort keys - Amazon Redshift

Working with sort keys

Note

We recommend that you create your tables with SORTKEY AUTO. If you do so, then Amazon Redshift uses automatic table optimization to choose the sort key. For more information, see Working with automatic table optimization. The rest of this section provides details about the sort order.

When you create a table, you can alternatively define one or more of its columns as sort keys. When data is initially loaded into the empty table, the rows are stored on disk in sorted order. Information about sort key columns is passed to the query planner, and the planner uses this information to construct plans that exploit the way that the data is sorted. For more information, see CREATE TABLE. For information on best practices when creating a sort key, see Choose the best sort key.

Sorting enables efficient handling of range-restricted predicates. Amazon Redshift stores columnar data in 1 MB disk blocks. The min and max values for each block are stored as part of the metadata. If a query uses a range-restricted predicate, the query processor can use the min and max values to rapidly skip over large numbers of blocks during table scans. For example, suppose that a table stores five years of data sorted by date and a query specifies a date range of one month. In this case, you can remove up to 98 percent of the disk blocks from the scan. If the data is not sorted, more of the disk blocks (possibly all of them) have to be scanned.

You can specify either a compound or interleaved sort key. A compound sort key is more efficient when query predicates use a prefix, which is a subset of the sort key columns in order. An interleaved sort key gives equal weight to each column in the sort key, so query predicates can use any subset of the columns that make up the sort key, in any order.

To understand the impact of the chosen sort key on query performance, use the EXPLAIN command. For more information, see Query planning and execution workflow.

To define a sort type, use either the INTERLEAVED or COMPOUND keyword with your CREATE TABLE or CREATE TABLE AS statement. The default is COMPOUND. COMPOUND is recommended when you update your tables regularly with INSERT, UPDATE, or DELETE operations. An INTERLEAVED sort key can use a maximum of eight columns. Depending on your data and cluster size, VACUUM REINDEX takes significantly longer than VACUUM FULL because it makes an additional pass to analyze the interleaved sort keys. The sort and merge operation can take longer for interleaved tables because the interleaved sort might have to rearrange more rows than a compound sort.

To view the sort keys for a table, query the SVV_TABLE_INFO system view.

Multidimensional data layout sorting (preview)

The following is prerelease documentation for the multidimensional data layout sorting of tables, which is in preview release. The documentation and the feature are both subject to change. We recommend that you use this feature only with test clusters, and not in production environments. For preview terms and conditions, see Beta Service Participation in AWS Service Terms.
Note

This feature is only available using a preview cluster or preview workgroup. To create a preview cluster, see Creating a preview cluster in the Amazon Redshift Management Guide. To create a preview workgroup, see Creating a preview workgroup in the Amazon Redshift Management Guide.

A multidimensional data layout sort key is a type of AUTO sort key that is based on repetitive predicates found in a workload. If your workload has repetitive predicates, then Amazon Redshift can improve table scan performance by colocating data rows that satisfy the repetitive predicates. Instead of storing data of a table in strict column order, a multidimensional data layout sort key stores data by analyzing repetitive predicates that appear in a workload. More than one repetitive predicate can be found in a workload. Depending on your workload, this kind of sort key can improve performance of many predicates. Amazon Redshift automatically determines if this sort key method should be used for tables that are defined with an AUTO sort key.

For example, suppose you have a table that has data sorted in column order. Many data blocks might need to be examined to determine if they satisfy the predicates in your workload. But, if the data is stored on disk in a predicate order, then fewer blocks need to be scanned to satisfy the query. Using a multidimensional data layout sort key is beneficial in this case.

To view whether a query is using a multidimensional data layout key, see the step_attribute column of the SYS_QUERY_DETAIL view. When the value is multi-dimensional then multidimensional data layout was used for the query. To view whether a table defined with the AUTO sort key is using a multidimensional data layout, see the sortkey1 column of the SVV_TABLE_INFO view. When the value is padb_internal_mddl_key_col then multidimensional data layout was used for the table sort key.

To prevent Amazon Redshift from using a multidimensional data layout sort key, choose a different table sort key option other than SORTKEY AUTO. For more information on SORTKEY options, see CREATE TABLE.

Compound sort key

A compound key is made up of all of the columns listed in the sort key definition, in the order they are listed. A compound sort key is most useful when a query's filter applies conditions, such as filters and joins, that use a prefix of the sort keys. The performance benefits of compound sorting decrease when queries depend only on secondary sort columns, without referencing the primary columns. COMPOUND is the default sort type.

Compound sort keys might speed up joins, GROUP BY and ORDER BY operations, and window functions that use PARTITION BY and ORDER BY. For example, a merge join, which is often faster than a hash join, is feasible when the data is distributed and presorted on the joining columns. Compound sort keys also help improve compression.

As you add rows to a sorted table that already contains data, the unsorted region grows, which has a significant effect on performance. The effect is greater when the table uses interleaved sorting, especially when the sort columns include data that increases monotonically, such as date or timestamp columns. Run a VACUUM operation regularly, especially after large data loads, to re-sort and re-analyze the data. For more information, see Managing the size of the unsorted region. After vacuuming to resort the data, it's a good practice to run an ANALYZE command to update the statistical metadata for the query planner. For more information, see Analyzing tables.

Interleaved sort key

An interleaved sort gives equal weight to each column, or subset of columns, in the sort key. If multiple queries use different columns for filters, then you can often improve performance for those queries by using an interleaved sort style. When a query uses restrictive predicates on secondary sort columns, interleaved sorting significantly improves query performance as compared to compound sorting.

Important

Don't use an interleaved sort key on columns with monotonically increasing attributes, such as identity columns, dates, or timestamps.

The performance improvements you gain by implementing an interleaved sort key should be weighed against increased load and vacuum times.

Interleaved sorts are most effective with highly selective queries that filter on one or more of the sort key columns in the WHERE clause, for example select c_name from customer where c_region = 'ASIA'. The benefits of interleaved sorting increase with the number of sorted columns that are restricted.

An interleaved sort is more effective with large tables. Sorting is applied on each slice. Thus, an interleaved sort is most effective when a table is large enough to require multiple 1 MB blocks per slice. Here, the query processor can skip a significant proportion of the blocks using restrictive predicates. To view the number of blocks a table uses, query the STV_BLOCKLIST system view.

When sorting on a single column, an interleaved sort might give better performance than a compound sort if the column values have a long common prefix. For example, URLs commonly begin with "http://www". Compound sort keys use a limited number of characters from the prefix, which results in a lot of duplication of keys. Interleaved sorts use an internal compression scheme for zone map values that enables them to better discriminate among column values that have a long common prefix.

When migrating Amazon Redshift provisioned clusters to Amazon Redshift Serverless, Redshift converts tables with interleaved sort keys and DISTSTYLE KEY to compound sort keys. The DISTSTYLE doesn't change. For more information on distribution styles, see Working with data distribution styles.

VACUUM REINDEX

As you add rows to a sorted table that already contains data, performance might deteriorate over time. This deterioration occurs for both compound and interleaved sorts, but it has a greater effect on interleaved tables. A VACUUM restores the sort order, but the operation can take longer for interleaved tables because merging new interleaved data might involve modifying every data block.

When tables are initially loaded, Amazon Redshift analyzes the distribution of the values in the sort key columns and uses that information for optimal interleaving of the sort key columns. As a table grows, the distribution of the values in the sort key columns can change, or skew, especially with date or timestamp columns. If the skew becomes too large, performance might be affected. To re-analyze the sort keys and restore performance, run the VACUUM command with the REINDEX key word. Because it must take an extra analysis pass over the data, VACUUM REINDEX can take longer than a standard VACUUM for interleaved tables. To view information about key distribution skew and last reindex time, query the SVV_INTERLEAVED_COLUMNS system view.

For more information about how to determine how often to run VACUUM and when to run a VACUUM REINDEX, see Deciding whether to reindex.