Amazon Redshift
Database Developer Guide (API Version 2012-12-01)
Did this page help you?  Yes | No |  Tell us about it...
« PreviousNext »
View the PDF for this guide.Go to the AWS Discussion Forum for this product.

Analyzing Tables

You should, at regular intervals, update the statistical metadata that the query planner uses to build and choose optimal plans. To do so, you analyze your tables.

You can analyze a table explicitly by running the ANALYZE command. When you load data with the COPY command, you can perform an analysis automatically by setting the STATUPDATE option to ON. By default, the COPY command performs an analysis after it loads data into an empty table. You can force an analysis regardless of whether a table is empty by setting STATUPDATE ON. If you specify STATUPDATE OFF, no analysis is performed .

Only the table owner or a superuser can run the ANALYZE command or run the COPY command with STATUPDATE set to ON.

If you run a query against a new table that was not analyzed after its data was initially loaded, a warning message is displayed; however, no warning occurs when you query a table after a subsequent update or load. The same behavior occurs when you run the EXPLAIN command on a query that contains tables that have not been analyzed.

Whenever adding data to a nonempty table significantly changes the size of the table, we recommend that you update statistics either by running an ANALYZE command or by using the STATUPDATE ON option with the COPY command.

If performance degradation occurs that might be the result of inefficient data storage or a significant change in the statistical profile of the data, run the analysis to see if the updated statistics solve the problem.

To build or update statistics, run the ANALYZE command against:

  • The entire current database

  • A single table

  • One or more specific columns in a single table

The ANALYZE command obtains a sample of rows from the table, does some calculations, and saves resulting column statistics. By default, Amazon Redshift runs a sample pass for the DISTKEY column and another sample pass for all of the other columns in the table. If you want to generate statistics for a subset of columns, you can specify a comma-separated column list.

ANALYZE operations are resource intensive, so run them only on tables and columns that actually require statistics updates. You do not need to analyze all columns in all tables regularly or on the same schedule. If the data changes substantially, analyze the columns that are frequently used in the following:

  • Sorting and grouping operations

  • Joins

  • Query predicates

Columns that are less likely to require frequent analysis are those that represent facts and measures and any related attributes that are never actually queried, such as large VARCHAR columns. For example, consider the LISTING table in the TICKIT database:

select "column", type, encoding, distkey, sortkey
from pg_table_def where tablename = 'listing';

column         |        type        | encoding | distkey | sortkey 
---------------+--------------------+----------+---------+---------
listid         | integer            | none     | t       | 1       
sellerid       | integer            | none     | f       | 0       
eventid        | integer            | mostly16 | f       | 0       
dateid         | smallint           | none     | f       | 0       
numtickets     | smallint           | mostly8  | f       | 0       
priceperticket | numeric(8,2)       | bytedict | f       | 0       
totalprice     | numeric(8,2)       | mostly32 | f       | 0       
listtime       | timestamp with...  | none     | f       | 0       

If this table is loaded every day with a large number of new records, the LISTID column, which is frequently used in queries as a join key, would need to be analyzed regularly. If TOTALPRICE and LISTTIME are the frequently used constraints in queries, you could analyze those columns and the distribution key on every weekday:

analyze listing(listid, totalprice, listtime);

If the sellers and events in the application are much more static, and the date IDs refer to a fixed set of days covering only two or three years, the unique values for these columns will not change significantly, although the number of instances of each unique value will increase steadily. In addition, if the NUMTICKETS and PRICEPERTICKET measures are queried infrequently compared to the TOTALPRICE column, you could run the ANALYZE command on the whole table once every weekend to update statistics for the five columns that are not analyzed daily:

analyze listing;

To maintain current statistics for tables, do the following:

  • Run the ANALYZE command before running queries.

  • Run the ANALYZE command against the database routinely at the end of every regular load or update cycle.

  • Run the ANALYZE command against any new tables that you create and any existing tables or columns that undergo significant change.

  • Consider running ANALYZE operations on different schedules for different types of tables and columns, depending on their use in queries and their propensity to change.