Reducing bloat in tables and indexes with the pg_repack extension - Amazon Relational Database Service

Reducing bloat in tables and indexes with the pg_repack extension

You can use the pg_repack extension to remove bloat from tables and indexes as an alternative to VACUUM FULL. This extension is supported on RDS for PostgreSQL versions 9.6.3 and higher. For more information on the pg_repack extension and the full table repack, see the GitHub project documentation.

Unlike VACUUM FULL, the pg_repack extension requires an exclusive lock (AccessExclusiveLock) only for a short period of time during the table rebuild operation in the following cases:

  • Initial creation of log table – A log table is created to record changes that occur during initial copy of the data, as shown in the following example:

    postgres=>\dt+ repack.log_* List of relations -[ RECORD 1 ]-+---------- Schema | repack Name | log_16490 Type | table Owner | postgres Persistence | permanent Access method | heap Size | 65 MB Description |
  • Final swap-and-drop phase.

For the rest of the rebuild operation, it only needs an ACCESS SHARE lock on the original table to copy rows from it to the new table. This helps the INSERT, UPDATE, and DELETE operations to proceed as usual.

Recommendations

The following recommendations apply when you remove bloat from the tables and indexes using the pg_repack extension:

  • Perform repack during non-business hours or over a maintenance window to minimize its impact on performance of other database activities.

  • Closely monitor blocking sessions during the rebuild activity and ensure that there is no activity on the original table that could potentially block pg_repack, specifically during the final swap-and-drop phase when it needs an exclusive lock on the original table. For more information, see Identifying what is blocking a query.

    When you see a blocking session, you can terminate it using the following command after careful consideration. This helps in the continuation of pg_repack to finish the rebuild:

    SELECT pg_terminate_backend(pid);
  • While applying the accrued changes from the pg_repack's log table on systems with a very high transaction rate, the apply process might not be able to keep up with the rate of changes. In such cases, pg_repack would not be able to complete the apply process. For more information, see Monitoring the new table during the repack. If indexes are severely bloated, an alternative solution is to perform an index only repack. This also helps VACUUM's index cleanup cycles to finish faster.

    You can skip the index cleanup phase using manual VACUUM from PostgreSQL version 12, and it is skipped automatically during emergency autovacuum from PostgreSQL version 14. This helps VACUUM complete faster without removing the index bloat and is only meant for emergency situations such as preventing wraparound VACUUM. For more information, see Avoiding bloat in indexes in the Amazon Aurora User Guide.

Pre-requisites

  • The table must have PRIMARY KEY or not-null UNIQUE constraint.

  • The extension version must be the same for both the client and the server.

  • Ensure that the RDS instance has more FreeStorageSpace than the total size of the table without the bloat. As an example, consider the total size of the table including TOAST and indexes as 2TB, and total bloat in the table as 1TB. The required FreeStorageSpace must be more than value returned by the following calculation:

    2TB (Table size) - 1TB (Table bloat) = 1TB

    You can use the following query to check the total size of the table and use pgstattuple to derive bloat. For more information, see Diagnosing table and index bloat in the Amazon Aurora User Guide

    SELECT pg_size_pretty(pg_total_relation_size('table_name')) AS total_table_size;

    This space is reclaimed after the completion of the activity.

  • Ensure that the RDS instance has enough compute and IO capacity to handle the repack operation. You might consider to scale up the instance class for optimal balance of performance.

To use the pg_repack extension
  1. Install the pg_repack extension on your RDS for PostgreSQL DB instance by running the following command.

    CREATE EXTENSION pg_repack;
  2. Run the following commands to grant write access to temporary log tables created by pg_repack.

    ALTER DEFAULT PRIVILEGES IN SCHEMA repack GRANT INSERT ON TABLES TO PUBLIC; ALTER DEFAULT PRIVILEGES IN SCHEMA repack GRANT USAGE, SELECT ON SEQUENCES TO PUBLIC;
  3. Connect to the database using the pg_repack client utility. Use an account that has rds_superuser privileges. As an example, assume that rds_test role has rds_superuser privileges. The following syntax performs pg_repack for full tables including all the table indexes in the postgres database.

    pg_repack -h db-instance-name.111122223333.aws-region.rds.amazonaws.com -U rds_test -k postgres
    Note

    You must connect using the -k option. The -a option is not supported.

    The response from the pg_repack client provides information on the tables on the DB instance that are repacked.

    INFO: repacking table "pgbench_tellers" INFO: repacking table "pgbench_accounts" INFO: repacking table "pgbench_branches"
  4. The following syntax repacks a single table orders including indexes in postgres database.

    pg_repack -h db-instance-name.111122223333.aws-region.rds.amazonaws.com -U rds_test --table orders -k postgres

    The following syntax repacks only indexes for orders table in postgres database.

    pg_repack -h db-instance-name.111122223333.aws-region.rds.amazonaws.com -U rds_test --table orders --only-indexes -k postgres

Monitoring the new table during the repack

  • The size of the database is increased by the total size of the table minus bloat, until swap-and-drop phase of repack. You can monitor the growth rate of the database size, calculate the speed of the repack, and roughly estimate the time it takes to complete initial data transfer.

    As an example, consider the total size of the table as 2TB, the size of the database as 4TB, and total bloat in the table as 1TB. The database total size value returned by the calculation at the end of the repack operation is the following:

    2TB (Table size) + 4 TB (Database size) - 1TB (Table bloat) = 5TB

    You can roughly estimate the speed of the repack operation by sampling the growth rate in bytes between two points in time. If the growth rate is 1GB per minute, it can take 1000 minutes or 16.6 hours approximately to complete the initial table build operation. In addition to the initial table build, pg_repack also needs to apply accrued changes. The time it takes depends on the rate of applying ongoing changes plus accrued changes.

    Note

    You can use pgstattuple extension to calculate the bloat in the table. For more information, see pgstattuple .

  • The number of rows in the pg_repack's log table, under the repack schema represents the volume of changes pending to be applied to the new table after the initial load.

    You can check the pg_repack's log table in pg_stat_all_tables to monitor the changes applied to the new table. pg_stat_all_tables.n_live_tup indicates the number of records that are pending to be applied to the new table. For more information, see pg_stat_all_tables.

    postgres=>SELECT relname,n_live_tup FROM pg_stat_all_tables WHERE schemaname = 'repack' AND relname ILIKE '%log%'; -[ RECORD 1 ]--------- relname | log_16490 n_live_tup | 2000000
  • You can use the pg_stat_statements extension to find out the time taken by each step in the repack operation. This is helpful in preparation for applying the same repack operation in a production environment. You may adjust the LIMIT clause for extending the output further.

    postgres=>SELECT SUBSTR(query, 1, 100) query, round((round(total_exec_time::numeric, 6) / 1000 / 60),4) total_exec_time_in_minutes FROM pg_stat_statements WHERE query ILIKE '%repack%' ORDER BY total_exec_time DESC LIMIT 5; query | total_exec_time_in_minutes -----------------------------------------------------------------------+---------------------------- CREATE UNIQUE INDEX index_16493 ON repack.table_16490 USING btree (a) | 6.8627 INSERT INTO repack.table_16490 SELECT a FROM ONLY public.t1 | 6.4150 SELECT repack.repack_apply($1, $2, $3, $4, $5, $6) | 0.5395 SELECT repack.repack_drop($1, $2) | 0.0004 SELECT repack.repack_swap($1) | 0.0004 (5 rows)

Repacking is completely an out-of-place operation so the original table is not impacted and we do not anticipate any unexpected challenges that require recovery of the original table. If repack fails unexpectedly, you must inspect the cause of the error and resolve it.

After the issue is resolved, drop and recreate the pg_repack extension in the database where the table exists, and retry the pg_repack step. In addition, the availability of compute resources and concurrent accessibility of the table plays a crucial role in the timely completion of the repack operation.