Storage optimizations for governed tables - AWS Lake Formation

Storage optimizations for governed tables

Governed tables are created with all storage optimizations features enabled by default. These include data compaction and garbage collection.

Data compaction

An important use case for governed tables is for streaming data or other applications in which small chunks of data arrive into the Amazon S3 data lake continuously. An individual table could grow to thousands of Amazon S3 objects. Each governed table maintains a manifest that identifies all Amazon S3 objects that the table comprises. This manifest is versioned and updated atomically to ensure that you always see a consistent view of the table.

Note

The data compaction optimizer constantly monitors your table partitions and will kick off when the threshold is exceeded for the number of files and file sizes. Lake Formation performs compaction without interfering with concurrent queries. Compaction is currently supported only for partitioned tables in the Parquet format.

Garbage collection

Another storage optimization feature of governed tables helps decrease storage costs by deleting Amazon S3 objects that are no longer part of the governed table. When a transaction is cancelled while objects are being added to the manifest for a governed table, the objects are not automatically cleaned up. This is to allow for transactions to be retried without needing to regenerate the data. In some cases, it is desirable to remove objects from canceled transactions.

To use this feature you must first call DeleteObjectsOnCancel before calling S3 PutObject. This tells Lake Formation to asynchronously delete these files to help save costs. Calling DeleteObjectsOnCancel provides the authorization to remove the objects from Amazon S3 in case the transaction aborts. This feature cannot be manually disabled. For more information about aborted transactions and removing unneeded objects, see Rolling back Amazon S3 writes.

Note

Data compaction only works for Parquet partitioned tables.

Prerequisites for using Storage Optimization

Before you can use data compaction you must complete the setup instructions described in Prepare for using automatic data compaction with governed tables.

Disabling and Re-enabling Data Compaction for Governed Tables

For better performance by ETL jobs and analytics services, Lake Formation automatically compacts small Amazon S3 objects of governed tables into larger objects. Data compaction is enabled for governed tables by default. You can disable compaction for individual governed tables, and re-enable compaction at a later time.

Data compaction can be enabled and disabled using either the Lake Formation console or AWS CLI.

Console

To disable or re-enable data compaction for a governed table

  1. Open the Lake Formation console at https://console.aws.amazon.com/lakeformation/.

    Sign in as a data lake administrator, the table creator, or a user who has been granted the glue:UpdateTable permission and the Lake Formation ALTER permission on the table.

  2. In the navigation pane, choose Tables.

  3. Choose the option button next to a table name, and on the Actions menu, choose Edit.

    Note

    Ensure that you choose a governed table. Governed tables have Enabled in the Governance column.

  4. On the Edit table page, do one of the following:

    • Under Data management and security, select or clear the Automatic compaction option.

  5. Choose Save.

AWS CLI

For example to disable compaction, you can use the following AWS CLI command.

aws update-table-storage-optimizer --database-name database-name --table-name table-name --storage-optimizer-config '{"compaction" :{"is_enabled": "false"}}')

To re-enable data compaction for a table, use similar code, but set the value of is_enabled to true.

Checking Governed Table Compaction Status

For a governed table, you can see the status of the data compaction and delete objects for canceled transactions optimizers by viewing the table in the console, or by running an AWS CLI command.

Console

To check the compaction status for governed table

  1. Open the Lake Formation console at https://console.aws.amazon.com/lakeformation/.

    Sign in as a data lake administrator, the table creator, or a user who has been granted the glue:GetTable permission and any Lake Formation permission on the table.

  2. In the navigation pane, choose Tables.

  3. On the Tables page, choose the table name.

  4. Under Table details, scroll down to the Governance and acceleration details section.

    
                    The Governance and acceleration details section has two columns. The
                      left-side column contains the following information: Compaction, (status)
                      Success, (grayed out) Show warning message, and Results "Status=completed,
                      RunTime=1005715, StartTime=2021-09-22T18:12:47.740Z, CompactedFiles=3982,
                      CompactedBytes=1119504421. The right-side column contains the following
                      information: Garbage collection, (status) Success, (grayed out) Show warning
                      message, and Results "Status=completed, RunTime=20258,
                      StartTime=2021-09-21T00:03:13.246Z, FilesReceived=100, FilesRemoved=100,
                      FilesUnableToRemove=0.
AWS CLI

Use a command similar to the following to view the configuration and last run status of all the accelerations associated with the specific table.

aws list-table-storage-optimizers --database-name database-name --table-name table-name

The following is an example of the response for this command.

[ { StorageOptimizerType: "compaction", config: { "state": "enabled" }, errorMessage: "", lastRunDetails: "lastRunTime: December 14, 2021. Compacted 1000 objects" }, { StorageOptimizerType: "garbage_collection", config: { "state": "disabled" }, errorMessage: "IAM role is missing DeleteObject permissions", lastRunDetails: "lastRunTime: December 14, 2021. Collected 1000 objects" } ]