Generating column statistics - AWS Glue

Generating column statistics

Follow these steps to manage statistics generation in the Data Catalog using AWS Glue console or AWS CLI.

Console
To generate column statistics using the console
  1. Sign in to the AWS Glue console at https://console.aws.amazon.com/glue/.

  2. Choose Data Catalog tables.

  3. Choose a table from the list.

  4. Choose Generate statistics under Actions menu.

    You can also choose Generate statistics button under Column statistics tab in the lower section of the Tables page.

  5. On the Generate statistics page, specify the following options:

    The screenshot shows the options available to generate column stats.
    • Table (all columns) – Choose this option to generate statistics for all columns in the table.

    • Selected columns – Choose this option to generate statistics for specific columns. You can select the columns from the drop-down list.

    • All rows – Choose all rows from the table to generate accurate statistics.

    • Sample rows – Choose only a specific percent of rows from the table to generate statistics. The default is all rows. Use the up and down arrows to increase or decrease the percent value.

      Note

      We recommend to include all rows in the table to compute accurate statistics. Use sample rows to generate column statistics only when approximate values are acceptable.

  6. (Optional) Next, choose a security configuration to enable at-rest encryption for logs.

  7. Choose Generate statistics to run the process.

AWS CLI

In the following example, replace values for DatabaseName, TableName, and ColumnNameList with actual database, table, and column names. Replace account ID with a valid AWS account, and role name with the name of the IAM role that you're using to generate statistics.

aws glue start-column-statistics-task-run --input-cli-json file://input.json { "DatabaseName": "<test-db>", "TableName": "<test-table>", "ColumnNameList": [ "<column1>", "<column2>", ], "Role": "arn:aws:iam::<123456789012>:role/<Stats-Role>", "SampleSize": 10.0 }

You can generate column statistics also by calling the StartColumnStatisticsTaskRun operation.