To clean up tables after a bulk delete, a load, or a series of incremental updates, you need to run the VACUUM command, either against the entire database or against individual tables.
Only the table owner or a superuser can effectively vacuum a table. If you do not have owner or superuser privileges for a table, a VACUUM that specifies a single table will fail. If you run a VACUUM of the entire database, without specifying a table name, the operation completes successfully but has no effect on tables for which you do not have owner or superuser privileges. For this reason, and because vacuuming the entire database is potentially an expensive operation, we recommend vacuuming individual tables as needed.
Amazon Redshift does not automatically reclaim and reuse space that is freed when you delete rows and update rows. To perform an update, Amazon Redshift deletes the original row and appends the updated row, so every update is effectively a delete followed by an insert. When you perform a delete, the rows are marked for deletion, but not removed. The query processor needs to scan the deleted rows as well as undeleted rows, so too many deleted rows can cost unnecessary processing. You should vacuum following a significant number of deletes or updates to reclaim space and improve query performance.
For tables with a sort key, the VACUUM command ensures that new data in tables is fully sorted on disk. When data is initially loaded into a table that has a sort key, the data is sorted according to the SORTKEY specification in the CREATE TABLE statement. However, when you update the table, using COPY, INSERT, or UPDATE statements, new rows are stored in a separate unsorted region on disk, then sorted on demand for queries as required. If large numbers of rows remain unsorted on disk, query performance might be degraded for operations that rely on sorted data, such as range-restricted scans or merge joins. The VACUUM command merges new rows with existing sorted rows, so range-restricted scans are more efficient and the execution engine doesn't need to sort rows on demand during query execution.
When a table is sorted using an interleaved sort key, Amazon Redshift analyzes the distribution of values in the sort key columns to determine the optimal sort strategy. Over time, that distribution can change, or skew, which might degrade performance. Run a VACUUM REINDEX to re-analyze the sort key distribution and restore performance. For more information, see Interleaved Sort Key.
You should vacuum as often as you need to in order to maintain consistent query performance. Consider these factors when determining how often to run your VACUUM command.
Run VACUUM during time periods when you expect minimal activity on the cluster, such as evenings or during designated database administration windows.
A large unsorted region results in longer vacuum times. If you delay vacuuming, the vacuum will take longer because more data has to be reorganized.
VACUUM is an I/O intensive operation, so the longer it takes for your vacuum to complete, the more impact it will have on concurrent queries and other database operations running on your cluster.
VACUUM takes longer for tables that use interleaved sorting. To evaluate whether interleaved tables need to be resorted, query the SVV_INTERLEAVED_COLUMNS view.
Sort Stage and Merge Stage
Amazon Redshift performs a vacuum operation in two stages: first, it sorts the rows in the unsorted region, then, if necessary, it merges the newly sorted rows at the end of the table with the existing rows. When vacuuming a large table, the vacuum operation proceeds in a series of steps consisting of incremental sorts followed by merges. If the operation fails or if Amazon Redshift goes off line during the vacuum, the partially vacuumed table or database will be in a consistent state, but you will need to manually restart the vacuum operation. Incremental sorts are lost, but merged rows that were committed before the failure do not need to be vacuumed again. If the unsorted region is large, the lost time might be significant. For more information about the sort and merge stages, see Managing the Volume of Merged Rows.
Users can access tables while they are being vacuumed. You can perform queries and write operations while a table is being vacuumed, but when DML and a vacuum run concurrently, both might take longer. If you execute UPDATE and DELETE statements during a vacuum, system performance might be reduced. Incremental merges temporarily block concurrent UPDATE and DELETE operations, and UPDATE and DELETE operations in turn temporarily block incremental merge steps on the affected tables. DDL operations, such as ALTER TABLE, are blocked until the vacuum operation finishes with the table.
By default, VACUUM skips the sort phase for any table where more than 95 percent of the table's rows are already sorted. Skipping the sort phase can significantly improve VACUUM performance. In addition, in the delete phase VACUUM reclaims space such that at least 95 percent of the remaining rows are not marked for deletion. Because VACUUM can often skip rewriting many blocks that contain only a few rows marked for deletion, it usually needs much less time for the delete phase compared to reclaiming 100 percent of deleted rows. To change the default sort threshold for a single table, include the table name and the TO threshold PERCENT parameter when you run the VACUUM command.
You can run a full vacuum, a delete only vacuum, a sort only vacuum, or a reindex with full vacuum.
We recommend a full vacuum for most applications where reclaiming space and resorting rows are equally important. It's more efficient to run a full vacuum than to run back-to-back DELETE ONLY and SORT ONLY vacuum operations. VACUUM FULL is the same as VACUUM. Full vacuum is the default vacuum operation.
VACUUM DELETE ONLY
A DELETE ONLY vacuum is the same as a full vacuum except that it skips the sort. A DELETE ONLY vacuum saves time when reclaiming disk space is important but resorting new rows is not. For example, you might perform a DELETE ONLY vacuum operation if you don't need to resort rows to optimize query performance.
VACUUM SORT ONLY
A SORT ONLY vacuum saves some time by not reclaiming disk space, but in most cases there is little benefit compared to a full vacuum.
Use VACUUM REINDEX for tables that use interleaved sort keys. REINDEX reanalyzes the distribution of the values in the table's sort key columns, then performs a full VACUUM operation. VACUUM REINDEX takes significantly longer than VACUUM FULL because it needs to take an extra analysis pass over the data, and because merging in new interleaved data can involve touching all the data blocks.
If a VACUUM REINDEX operation terminates before it completes, the next VACUUM resumes the reindex operation before performing the vacuum.