Managing statistics - SQL Server to Aurora PostgreSQL Migration Playbook

Managing statistics

This topic provides reference information about statistics and query optimization in SQL Server and PostgreSQL databases. You can understand how these database systems use statistics to improve query performance and how they differ in their approach to collecting and managing statistical data. The topic compares the methods for creating, viewing, and updating statistics in SQL Server with similar functionality in PostgreSQL.

Feature compatibility AWS SCT / AWS DMS automation level AWS SCT action code index Key differences

Three star feature compatibility

N/A

N/A

Syntax and option differences, similar functionality.

SQL Server Usage

Statistics objects in SQL Server are designed to support SQL Server cost-based query optimizer. It uses statistics to evaluate the various plan options and choose an optimal plan for optimal query performance.

Statistics are stored as BLOBs in system tables and contain histograms and other statistical information about the distribution of values in one or more columns. A histogram is created for the first column only and samples the occurrence frequency of distinct values. Statistics and histograms are collected by either scanning the entire table or by sampling only a percentage of the rows.

You can view Statistics manually using the DBCC SHOW_STATISTICS statement or the more recent sys.dm_db_stats_properties and sys.dm_db_stats_histogram system views.

SQL Server provides the capability to create filtered statistics containing a WHERE predicate. Filtered statistics are useful for optimizing histogram granularity by eliminating rows whose values are of less interest, for example NULLs.

SQL Server can manage the collection and refresh of statistics automatically (the default). Use the AUTO_CREATE_STATISTICS and AUTO_UPDATE_STATISTICS database options to change the defaults.

When a query is submitted with AUTO_CREATE_STATISTICS on and the query optimizer may benefit from a statistics that don’t yet exist, SQL Server creates the statistics automatically. You can use the AUTO_UPDATE_STATISTICS_ASYNC database property to set new statistics creation to occur immediately (causing queries to wait) or to run asynchronously. When run asynchronously, the triggering run can’t benefit from optimizations the optimizer may derive from it.

After creation of a new statistics object, either automatically or explicitly using the CREATE STATISTICS statement, the refresh of the statistics is controlled by the AUTO_UPDATE_STATISTICS database option. When set to ON, statistics are recalculated when they are stale, which happens when significant data modifications have occurred since the last refresh.

Syntax

CREATE STATISTICS <Statistics Name>
ON <Table Name> (<Column> [,...])
[WHERE <Filter Predicate>]
[WITH <Statistics Options>;

Examples

The following example creates new statistics on multiple columns. Set to use a full scan and to not refresh.

CREATE STATISTICS MyStatistics
ON MyTable (Col1, Col2)
WITH FULLSCAN, NORECOMPUTE;

The following example updates statistics with a 50% sampling rate.

UPDATE STATISTICS MyTable(MyStatistics)
WITH SAMPLE 50 PERCENT;

View the statistics histogram and data.

DBCC SHOW_STATISTICS ('MyTable','MyStatistics');

Turn off automatic statistics creation for a database.

ALTER DATABASE MyDB SET AUTO_CREATE_STATS OFF;

For more information, see Statistics, CREATE STATISTICS (Transact-SQL), and DBCC SHOW_STATISTICS (Transact-SQL) in the SQL Server documentation.

PostgreSQL Usage

Use the ANALYZE command to collect statistics about a database, a table, or a specific table column. The PostgreSQL ANALYZE command collects table statistics that support the generation of efficient query run plans by the query planner.

  • HistogramsANALYZE collects statistics on table column values and creates a histogram of the approximate data distribution in each column.

  • Pages and RowsANALYZE collects statistics on the number of database pages and rows from which each table is comprised.

  • Data Sampling — For large tables, the ANALYZE command takes random samples of values rather than examining each row. This allows the ANALYZE command to scan very large tables in a relatively small amount of time.

  • Statistic Collection Granularity — Running the ANALYZE command without parameters instructs PostgreSQL to examine every table in the current schema. Supplying the table name or column name to ANALYZE instructs the database to examine a specific table or table column.

Automatic Statistics Collection

By default, PostgreSQL is configured with an AUTOVACUUM daemon which automates the run of statistics collection by using the ANALYZE commands (in addition to automation of the VACUUM command). The AUTOVACUUM daemon scans for tables that show signs of large modifications in data to collect the current statistics. AUTOVACUUM is controlled by several parameters.

Individual tables have several storage parameters which can trigger AUTOVACUUM process sooner or later. You can set or change such parameters as autovacuum_enabled, autovacuum_vacuum_threshold, and others, using CREATE TABLE or ALTER TABLE statements.

ALTER TABLE custom_autovaccum SET (autovacuum_enabled = true, autovacuum_vacuum_cost_delay = 10ms, autovacuum_vacuum_scale_factor = 0.01, autovacuum_analyze_scale_factor = 0.005);

The preceding command enables AUTOVACUUM for the custom_autovaccum table and specifies the AUTOVACUUM process to sleep for 10 milliseconds each run.

It also specifies a 1% of the table size to be added to autovacuum_vacuum_threshold and 0.5% of the table size to be added to autovacuum_analyze_threshold when deciding whether to trigger a VACUUM.

For more information, see Automatic Vacuuming in the PostgreSQL documentation.

Manual Statistics Collection

In PostgreSQL, you can collect statistics on-demand using the ANALYZE command at the database level, table level, or column level.

  • ANALYZE on indexes isn’t currently supported.

  • ANALYZE requires only a read-lock on the target table. It can run in parallel with other activity on the table.

  • For large tables, ANALYZE takes a random sample of the table contents. It is configured by the show default_statistics_target parameter. The default value is 100 entries. Raising the limit might allow more accurate planner estimates to be made at the price of consuming more space in the pg_statistic table.

Starting from PostgreSQL 10, there is a new command CREATE STATISTICS, which creates a new extended statistics object tracking data about the specified table.

The STATISTICS object tells the server to collect more detailed statistics.

Examples

The following example gathers statistics for the entire database.

ANALYZE;

The following example gathers statistics for a specific table. The VERBOSE keyword displays progress.

ANALYZE VERBOSE EMPLOYEES;

The following example gathers statistics for a specific column.

ANALYZE EMPLOYEES (HIRE_DATE);

Specify the default_statistics_target parameter for an individual table column and reset it back to default.

ALTER TABLE EMPLOYEES ALTER COLUMN SALARY SET STATISTICS 150;

ALTER TABLE EMPLOYEES ALTER COLUMN SALARY SET STATISTICS -1;

Larger values increase the time needed to complete an ANALYZE, but improve the quality of the collected planner’s statistics, which can potentially lead to better run plans.

View the current (session or global) default_statistics_target, modify it to 150, and analyze the EMPLOYEES table:

SHOW default_statistics_target ;
SET default_statistics_target to 150;
ANALYZE EMPLOYEES ;

View the last time statistics were collected for a table.

select relname, last_analyze from pg_stat_all_tables;

Summary

Feature SQL Server PostgreSQL

Analyze a specific database table

CREATE STATISTICS MyStatistics
ON MyTable (Col1, Col2)
ANALYZE EMPLOYEES;

Analyze a database table while only sampling certain rows

UPDATE STATISTICS MyTable(MyStatistics)
WITH SAMPLE 50 PERCENT;

Configure the number of entries for the table:

SET default_statistics_target to 150;
ANALYZE EMPLOYEES ;

View last time statistics were collected

DBCC SHOW_STATISTICS ('MyTable','MyStatistics');
select relname, last

For more information, see ANALYZE and The Autovacuum Daemon in the PostgreSQL documentation.