Factors affecting query performance - Amazon Redshift

Factors affecting query performance

A number of factors can affect query performance. The following aspects of your data, cluster, and database operations all play a part in how quickly your queries process.

  • Number of nodes, processors, or slices – A compute node is partitioned into slices. More nodes means more processors and more slices, which enables your queries to process faster by running portions of the query concurrently across the slices. However, more nodes also means greater expense, so you need to find the balance of cost and performance that is appropriate for your system. For more information on Amazon Redshift cluster architecture, see Data warehouse system architecture.

  • Node types – An Amazon Redshift cluster can use one of several node types. Each node type offers different sizes and limits to help you scale your cluster appropriately. The node size determines the storage capacity, memory, CPU, and price of each node in the cluster. For more information about node types, see Overview of Amazon Redshift clusters in the Amazon Redshift Management Guide.

  • Data distribution – Amazon Redshift stores table data on the compute nodes according to a table's distribution style. When you run a query, the query optimizer redistributes the data to the compute nodes as needed to perform any joins and aggregations. Choosing the right distribution style for a table helps minimize the impact of the redistribution step by locating the data where it needs to be before the joins are performed. For more information, see Working with data distribution styles.

  • Data sort order – Amazon Redshift stores table data on disk in sorted order according to a table’s sort keys. The query optimizer and the query processor use the information about where the data is located to reduce the number of blocks that need to be scanned and thereby improve query speed. For more information, see Working with sort keys.

  • Dataset size – A higher volume of data in the cluster can slow query performance for queries, because more rows need to be scanned and redistributed. You can mitigate this effect by regular vacuuming and archiving of data, and by using a predicate to restrict the query dataset.

  • Concurrent operations – Running multiple operations at once can affect query performance. Each operation takes one or more slots in an available query queue and uses the memory associated with those slots. If other operations are running, enough query queue slots might not be available. In this case, the query has to wait for slots to open before it can begin processing. For more information about creating and configuring query queues, see Implementing workload management.

  • Query structure – How your query is written affects its performance. As much as possible, write queries to process and return as little data as meets your needs. For more information, see Amazon Redshift best practices for designing queries.

  • Code compilation – Amazon Redshift generates and compiles code for each query execution plan.

    The compiled code runs faster because it removes the overhead of using an interpreter. You generally have some overhead cost the first time code is generated and compiled. As a result, the performance of a query the first time you run it can be misleading. The overhead cost might be especially noticeable when you run one-off queries. Run the query a second time to determine its typical performance. Amazon Redshift uses a serverless compilation service to scale query compilations beyond the compute resources of an Amazon Redshift cluster. The compiled code segments are cached locally on the cluster and in a virtually unlimited cache. This cache persists after cluster reboots. Subsequent executions of the same query run faster because they can skip the compilation phase.

    The cache is not compatible across Amazon Redshift versions, so the compilation cache is flushed and the code is recompiled when queries run after a version upgrade. If your queries have strict SLAs, we recommend you pre-run query segments that scan data from cluster tables. This lets Amazon Redshift cache the base table data, reducing the planning time for queries after a version upgrade. By using a scalable compilation service, Amazon Redshift can compile code in parallel to provide consistently fast performance. The magnitude of workload speed-up depends on the complexity and concurrency of queries.