Query performance factors for Amazon Redshift
A number of factors can affect query performance. The following aspects of your data, cluster, and database operations all play a part in how quickly your queries are processed:
-
-
Sort keys (Amazon Redshift Advisor)
-
Data compression (automated)
-
Data distribution (automated)
-
Table maintenance (automated)
-
-
-
Workload management (automated)
-
Short query acceleration (automated)
Table properties
Amazon Redshift tables are the fundamental units for storing data in Amazon Redshift, and each table has a set of properties that determine its behavior and accessibility. These properties include sorting, distribution style, compression encoding, and many others. Understanding these properties is crucial for optimizing the performance, security, and cost-effectiveness of Amazon Redshift tables.
Sort keys
Amazon Redshift stores data on disk in sorted order according to a table's sort keys. The
query optimizer and the query processor use the information about where the data is
located within a compute node to reduce the number of blocks that must be scanned.
This improves query speed significantly by reducing the amount of data to process.
We recommend that you use sort keys to facilitate filters in the WHERE
clause. For more information, see Working with sort keys in
the Amazon Redshift documentation.
Data compression
Data compression reduces storage requirements, which reduces disk I/O and improves
query performance. When you run a query, the compressed data is read into memory and
then uncompressed when the query runs. By loading less data into memory, Amazon Redshift can
allocate more memory to analyzing the data. Because columnar storage stores similar
data sequentially, Amazon Redshift can apply adaptive compression encodings specifically tied
to columnar data types. The best way to enable data compression on table columns is
by using the AUTO
option in Amazon Redshift to apply optimal compression encodings
when you load the table with data. To learn more about using automatic data
compression, see Loading tables with
automatic compression in the Amazon Redshift documentation.
Data distribution
Amazon Redshift stores data on the compute nodes according to a table's distribution style. When you run a query, the query optimizer redistributes the data to the compute nodes as needed to perform any joins and aggregations. Choosing the right distribution style for a table helps minimize the impact of the redistribution step by locating the data where it needs to be before the joins are performed. We recommend that you use distribution keys to facilitate the most common joins. For more information, see Working with data distribution styles in the Amazon Redshift documentation.
Table maintenance
Although Amazon Redshift provides industry-leading performance out of the box for most workloads, keeping Amazon Redshift clusters running well requires maintenance. Updating and deleting data creates dead rows that must be vacuumed, and even append-only tables must be resorted if the append order isn't consistent with the sort key.
Vacuum
The vacuuming process in Amazon Redshift is essential for the health and maintenance of
your Amazon Redshift cluster. It also affects the performance of queries. Because deletes
and updates both flag the old data but don't actually remove it, you must use
vacuuming to reclaim the disk space that was occupied by table rows that were
marked for deletion by the previous UPDATE
and DELETE
operations. Amazon Redshift can automatically sort and perform a VACUUM DELETE
operation on tables in the background.
To clean up tables after a load or a series of incremental updates, you can
also run the VACUUM
command, either against the entire database or
against individual tables. If tables have sort keys and table loads aren't
optimized to sort as they insert, then you must use vacuums to resort the data
(which can be crucial for performance). For more information, see Vacuuming
tables in the Amazon Redshift documentation.
Analyze
The ANALYZE
operation updates statistical metadata on the tables
in an Amazon Redshift database. Keeping statistics current improves query performance by
enabling the query planner to choose optimal plans. Amazon Redshift continuously monitors
your database and automatically performs analyze operations in the background.
To minimize impact to your system performance, the ANALYZE
operation automatically runs during periods when workloads are light. If you
choose to explicitly run ANALYZE
, do the following:
-
Run the
ANALYZE
command before running queries. -
Run the
ANALYZE
command on the database routinely at the end of every regular load or update cycle. -
Run the
ANALYZE
command on new tables that you create and existing tables or columns that undergo significant change. -
Consider running
ANALYZE
operations on different schedules for different types of tables and columns, depending on their use in queries and their propensity to change. -
To save time and cluster resources, use the
PREDICATE COLUMNS
clause when you run theANALYZE
command.
Cluster configuration
A cluster is a collection of nodes which perform the actual storing and processing of data. Setting up your Amazon Redshift cluster the right way is critical if you want to achieve the following:
-
High scalability and concurrency
-
Efficient use of Amazon Redshift
-
Better performance
-
Lower cost
Node type
An Amazon Redshift cluster can use one of several node types (RA3, DC2, and DS2). Each node type offers different sizes and limits to help you scale your cluster appropriately. The node size determines the storage capacity, memory, CPU, and price of each node in the cluster. Cost and performance optimization starts with choosing the right node type and size. For more information about node types, see Overview of Amazon Redshift clusters in the Amazon Redshift documentation.
Node size, number of nodes, and slices
A compute node is partitioned into slices. More nodes means more processors and slices, which enables your queries to process faster by running portions of the query concurrently across the slices. However, more nodes also means greater expense. This means that you must find the balance of cost and performance that is appropriate for your system. For more information on Amazon Redshift cluster architecture, see Data warehouse system architecture in the Amazon Redshift documentation.
Workload management
Amazon Redshift workload management (WLM) enables users to flexibly manage workload queues with priorities so that short, fast-running queries won't get stuck in queues behind long-running queries. Automatic WLM uses machine learning (ML) algorithms to profile queries and place them in the appropriate queue with the appropriate resources, while managing query concurrency and memory allocation. For more information about WLM, see Implementing workload management in the Amazon Redshift documentation.
Short query acceleration
Short query acceleration (SQA) prioritizes short-running queries ahead of long-running queries. SQA runs queries in a dedicated space so that SQA queries aren't forced to wait in queues behind longer queries. SQA only prioritizes queries that are short-running and are in a user-defined queue. If you use SQA, short-running queries begin running more quickly and you can see results sooner. If you enable SQA, you can reduce or eliminate WLM queues that are dedicated to short-running queries. In addition, long-running queries don't need to contend for slots in a WLM queue. This means that you can configure your WLM queues to use fewer query slots. If you use lower concurrency, query throughput is increased and overall system performance is improved for most workloads. For more information about SQA, see Working with short query acceleration in the Amazon Redshift documentation.
SQL query
A database query is a request for data from a database. The request should come in an Amazon Redshift cluster using SQL. Amazon Redshift supports SQL client tools that connect through Java Database Connectivity (JDBC) and Open Database Connectivity (ODBC). You can use most SQL client tools that support JDBC or ODBC drivers.
Query structure
How your query is written greatly affects its performance. We recommend that you write queries to process and return as little data as necessary to meet your needs. For more information on how to structure your queries, see the Best practices for designing Amazon Redshift queries section of this guide.
Code compilation
Amazon Redshift generates and compiles code for each query execution plan. The compiled code runs faster because it removes the overhead of using an interpreter. You generally have some overhead cost the first time that code is generated and compiled. As a result, the performance of a query the first time you run it can be misleading. The overhead cost could be especially noticeable when you run one-off queries. We recommend that you run the query a second time to determine its typical performance.
Amazon Redshift uses a serverless compilation service to scale query compilations beyond the compute resources of an Amazon Redshift cluster. The compiled code segments are cached locally on the cluster and in a virtually unlimited cache. This cache persists after cluster reboots. Subsequent invocations of the same query run faster because they can skip the compilation phase. The cache is not compatible across Amazon Redshift versions, so the code is recompiled when queries run after a version upgrade. By using a scalable compilation service, Amazon Redshift can compile code in parallel to provide consistently fast performance. The magnitude of workload speed-up depends on the complexity and concurrency of queries.