We recommend that at regular intervals you update the statistical metadata that the query planner uses to build and choose optimal plans. To do so, you analyze your tables.
You can analyze a table explicitly by running the ANALYZE command.
When you load data with the COPY command, you can perform an analysis automatically by setting the STATUPDATE option to ON. By default, the COPY command performs an analysis after it loads data into an empty table. You can force an analysis regardless of whether a table is empty by setting STATUPDATE ON. If you specify STATUPDATE OFF, no analysis is performed.
Only the table owner or a superuser can run the ANALYZE command or run the COPY command with STATUPDATE set to ON.
If you run a query against a new table that was not analyzed after its data was initially loaded, a warning message is displayed; however, no warning occurs when you query a table after a subsequent update or load. The same behavior occurs when you run the EXPLAIN command on a query that contains tables that have not been analyzed.
Whenever adding data to a nonempty table significantly changes the size of the table, we recommend that you update statistics either by running an ANALYZE command or by using the STATUPDATE ON option with the COPY command. To view details about the number of rows that have been inserted or deleted since the last ANALYZE, query the PG_STATISTIC_INDICATOR system catalog table.
If performance degradation occurs that might be the result of inefficient data storage or a significant change in the statistical profile of the data, run the analysis to see if the updated statistics solve the problem.
To build or update statistics, run the ANALYZE command against one of the following:
The entire current database
A single table
One or more specific columns in a single table
The ANALYZE command obtains a sample of rows from the table, does some calculations, and saves resulting column statistics. By default, Amazon Redshift runs a sample pass for the DISTKEY column and another sample pass for all of the other columns in the table. If you want to generate statistics for a subset of columns, you can specify a comma-separated column list.
ANALYZE operations are resource intensive, so run them only on tables and columns that actually require statistics updates. You don't need to analyze all columns in all tables regularly or on the same schedule. If the data changes substantially, analyze the columns that are frequently used in the following:
Sorting and grouping operations
To reduce processing time and improve overall system performance, Amazon Redshift skips analyze for any table that has a low percentage of changed rows, as determined by the analyze_threshold_percent parameter. By default, the analyze threshold is set to 10 percent. You can change the analyze threshold for the current session by executing a SET command.
Columns that are less likely to require frequent analysis are those that represent facts and measures and any related attributes that are never actually queried, such as large VARCHAR columns. For example, consider the LISTING table in the TICKIT database:
select "column", type, encoding, distkey, sortkey from pg_table_def where tablename = 'listing'; column | type | encoding | distkey | sortkey ---------------+--------------------+----------+---------+--------- listid | integer | none | t | 1 sellerid | integer | none | f | 0 eventid | integer | mostly16 | f | 0 dateid | smallint | none | f | 0 numtickets | smallint | mostly8 | f | 0 priceperticket | numeric(8,2) | bytedict | f | 0 totalprice | numeric(8,2) | mostly32 | f | 0 listtime | timestamp with... | none | f | 0
If this table is loaded every day with a large number of new records, the LISTID column, which is frequently used in queries as a join key, would need to be analyzed regularly. If TOTALPRICE and LISTTIME are the frequently used constraints in queries, you could analyze those columns and the distribution key on every weekday:
analyze listing(listid, totalprice, listtime);
If the sellers and events in the application are much more static, and the date IDs refer to a fixed set of days covering only two or three years, the unique values for these columns will not change significantly, although the number of instances of each unique value will increase steadily. In addition, if the NUMTICKETS and PRICEPERTICKET measures are queried infrequently compared to the TOTALPRICE column, you could run the ANALYZE command on the whole table once every weekend to update statistics for the five columns that are not analyzed daily:
To maintain current statistics for tables, do the following:
Run the ANALYZE command before running queries.
Run the ANALYZE command against the database routinely at the end of every regular load or update cycle.
Run the ANALYZE command against any new tables that you create and any existing tables or columns that undergo significant change.
Consider running ANALYZE operations on different schedules for different types of tables and columns, depending on their use in queries and their propensity to change.