Managing statistics
This topic provides reference information about statistics and query optimization in SQL Server and PostgreSQL databases. You can understand how these database systems use statistics to improve query performance and how they differ in their approach to collecting and managing statistical data. The topic compares the methods for creating, viewing, and updating statistics in SQL Server with similar functionality in PostgreSQL.
Feature compatibility | AWS SCT / AWS DMS automation level | AWS SCT action code index | Key differences |
---|---|---|---|
|
N/A |
N/A |
Syntax and option differences, similar functionality. |
SQL Server Usage
Statistics objects in SQL Server are designed to support SQL Server cost-based query optimizer. It uses statistics to evaluate the various plan options and choose an optimal plan for optimal query performance.
Statistics are stored as BLOBs in system tables and contain histograms and other statistical information about the distribution of values in one or more columns. A histogram is created for the first column only and samples the occurrence frequency of distinct values. Statistics and histograms are collected by either scanning the entire table or by sampling only a percentage of the rows.
You can view Statistics manually using the DBCC SHOW_STATISTICS
statement or the more recent sys.dm_db_stats_properties
and sys.dm_db_stats_histogram
system views.
SQL Server provides the capability to create filtered statistics containing a WHERE predicate. Filtered statistics are useful for optimizing histogram granularity by eliminating rows whose values are of less interest, for example NULLs.
SQL Server can manage the collection and refresh of statistics automatically (the default). Use the AUTO_CREATE_STATISTICS
and AUTO_UPDATE_STATISTICS
database options to change the defaults.
When a query is submitted with AUTO_CREATE_STATISTICS
on and the query optimizer may benefit from a statistics that don’t yet exist, SQL Server creates the statistics automatically. You can use the AUTO_UPDATE_STATISTICS_ASYNC
database property to set new statistics creation to occur immediately (causing queries to wait) or to run asynchronously. When run asynchronously, the triggering run can’t benefit from optimizations the optimizer may derive from it.
After creation of a new statistics object, either automatically or explicitly using the CREATE STATISTICS
statement, the refresh of the statistics is controlled by the AUTO_UPDATE_STATISTICS
database option. When set to ON
, statistics are recalculated when they are stale, which happens when significant data modifications have occurred since the last refresh.
Syntax
CREATE STATISTICS <Statistics Name> ON <Table Name> (<Column> [,...]) [WHERE <Filter Predicate>] [WITH <Statistics Options>;
Examples
The following example creates new statistics on multiple columns. Set to use a full scan and to not refresh.
CREATE STATISTICS MyStatistics ON MyTable (Col1, Col2) WITH FULLSCAN, NORECOMPUTE;
The following example updates statistics with a 50% sampling rate.
UPDATE STATISTICS MyTable(MyStatistics) WITH SAMPLE 50 PERCENT;
View the statistics histogram and data.
DBCC SHOW_STATISTICS ('MyTable','MyStatistics');
Turn off automatic statistics creation for a database.
ALTER DATABASE MyDB SET AUTO_CREATE_STATS OFF;
For more information, see Statistics
PostgreSQL Usage
Use the ANALYZE
command to collect statistics about a database, a table, or a specific table column. The PostgreSQL ANALYZE
command collects table statistics that support the generation of efficient query run plans by the query planner.
-
Histograms —
ANALYZE
collects statistics on table column values and creates a histogram of the approximate data distribution in each column. -
Pages and Rows —
ANALYZE
collects statistics on the number of database pages and rows from which each table is comprised. -
Data Sampling — For large tables, the
ANALYZE
command takes random samples of values rather than examining each row. This allows theANALYZE
command to scan very large tables in a relatively small amount of time. -
Statistic Collection Granularity — Running the
ANALYZE
command without parameters instructs PostgreSQL to examine every table in the current schema. Supplying the table name or column name toANALYZE
instructs the database to examine a specific table or table column.
Automatic Statistics Collection
By default, PostgreSQL is configured with an AUTOVACUUM
daemon which automates the run of statistics collection by using the ANALYZE commands (in addition to automation of the VACUUM command). The AUTOVACUUM
daemon scans for tables that show signs of large modifications in data to collect the current statistics. AUTOVACUUM
is controlled by several parameters.
Individual tables have several storage parameters which can trigger AUTOVACUUM
process sooner or later. You can set or change such parameters as autovacuum_enabled
, autovacuum_vacuum_threshold
, and others, using CREATE TABLE
or ALTER TABLE
statements.
ALTER TABLE custom_autovaccum SET (autovacuum_enabled = true, autovacuum_vacuum_cost_delay = 10ms, autovacuum_vacuum_scale_factor = 0.01, autovacuum_analyze_scale_factor = 0.005);
The preceding command enables AUTOVACUUM
for the custom_autovaccum
table and specifies the AUTOVACUUM
process to sleep for 10 milliseconds each run.
It also specifies a 1% of the table size to be added to autovacuum_vacuum_threshold
and 0.5% of the table size to be added to autovacuum_analyze_threshold
when deciding whether to trigger a VACUUM
.
For more information, see Automatic Vacuuming
Manual Statistics Collection
In PostgreSQL, you can collect statistics on-demand using the ANALYZE
command at the database level, table level, or column level.
-
ANALYZE
on indexes isn’t currently supported. -
ANALYZE
requires only a read-lock on the target table. It can run in parallel with other activity on the table. -
For large tables,
ANALYZE
takes a random sample of the table contents. It is configured by the showdefault_statistics_target
parameter. The default value is 100 entries. Raising the limit might allow more accurate planner estimates to be made at the price of consuming more space in thepg_statistic
table.
Starting from PostgreSQL 10, there is a new command CREATE STATISTICS
, which creates a new extended statistics object tracking data about the specified table.
The STATISTICS
object tells the server to collect more detailed statistics.
Examples
The following example gathers statistics for the entire database.
ANALYZE;
The following example gathers statistics for a specific table. The VERBOSE
keyword displays progress.
ANALYZE VERBOSE EMPLOYEES;
The following example gathers statistics for a specific column.
ANALYZE EMPLOYEES (HIRE_DATE);
Specify the default_statistics_target parameter for an individual table column and reset it back to default.
ALTER TABLE EMPLOYEES ALTER COLUMN SALARY SET STATISTICS 150; ALTER TABLE EMPLOYEES ALTER COLUMN SALARY SET STATISTICS -1;
Larger values increase the time needed to complete an ANALYZE, but improve the quality of the collected planner’s statistics, which can potentially lead to better run plans.
View the current (session or global) default_statistics_target
, modify it to 150, and analyze the EMPLOYEES
table:
SHOW default_statistics_target ; SET default_statistics_target to 150; ANALYZE EMPLOYEES ;
View the last time statistics were collected for a table.
select relname, last_analyze from pg_stat_all_tables;
Summary
Feature | SQL Server | PostgreSQL |
---|---|---|
Analyze a specific database table |
CREATE STATISTICS MyStatistics ON MyTable (Col1, Col2) |
ANALYZE EMPLOYEES; |
Analyze a database table while only sampling certain rows |
UPDATE STATISTICS MyTable(MyStatistics) WITH SAMPLE 50 PERCENT; |
Configure the number of entries for the table: SET default_statistics_target to 150; ANALYZE EMPLOYEES ; |
View last time statistics were collected |
DBCC SHOW_STATISTICS ('MyTable','MyStatistics'); |
select relname, last |
For more information, see ANALYZE