Governed tables in Lake Formation - AWS Lake Formation

Governed tables in Lake Formation

The metadata tables in the AWS Glue Data Catalog store information about data sources and targets—including schema information, partition information, data location, and more.

The Data Catalog supports two types of metadata tables: governed tables and non-governed tables. Governed tables are unique to AWS Lake Formation. When you create a table, you can specify whether the table is governed.

Governed tables offer the following advanced features:

ACID transactions

ACID (atomic, consistent, isolated, and durable) transactions protect the integrity of Data Catalog operations such as creating or updating a table. They also enable multiple users to concurrently and reliably add and delete objects in the Amazon S3 data lake, while still allowing other users to simultaneously run analytical queries and machine learning (ML) models on the same datasets that return consistent and up-to-date results. When governed tables are involved in reads from or writes to the data lake on Amazon S3, those operations occur within a transaction.

Transactions protect the integrity of governed table metadata, including the manifest—the metadata that defines the Amazon S3 objects in the table's underlying data. Integrated AWS services such as Amazon Athena support governed tables to provide consistent reads in queries. To use transactions in your AWS Glue ETL jobs, you begin a transaction before you perform any reads from or writes to the data lake, and you commit the transaction upon completion.

For more information about transactions, see Reading from and writing to the data lake within transactions.

Automatic data compaction

For better performance by ETL jobs and analytics services such as Athena, Lake Formation automatically compacts the small Amazon S3 objects of governed tables into larger objects.

Compaction is enabled for governed tables by default. You can disable compaction for individual governed tables. For more information, see Storage optimizations for governed tables.

Time-travel queries

As mentioned previously, each governed table maintains a versioned manifest of the Amazon S3 objects that it comprises. Previous versions of the manifest can be used for time-travel queries. Your queries against governed tables in Athena and in AWS Glue ETL jobs can include a timestamp to indicate that you want to discover the state of the data at a particular date and time.

To submit a time-travel query in Athena, use the syntax FOR SYSTEM_TIME AS OF timestamp or FOR SYSTEM_VERSION AS OF version.

SELECT * FROM cloudtraildb.cloudtraildata FOR SYSTEM_TIME AS OF TIMESTAMP '2021-09-30 10:00:00'

For more examples of Athena time-traveling queries of governed tables, see Querying Governed Tables in the Amazon Athena User Guide.

In your ETL job script, to read data into a dynamic frame using time travel, include code similar to the following.

dynamic_frame = glueContext.create_dynamic_frame_from_catalog(database = 'cloudtraildb, table_name = 'cloudtraildata', additional_options = {"asOfTime": "2021-09-30 10:00:00"})
val persons: DynamicFrame = glueContext.getCatalogSource(database = "cloudtraildb", tableName = "cloudtraildata", additional_options = JsonOptions("""{"asOfTime": "2021-09-30 10:00:00"}""") ).getDynamicFrame()

Lake Formation permissions are not versioned. Time travel queries always honor current permissions. For example, if permissions at time T1 limited access to table columns and the current permissions (at time T2) grant access to all columns, a time travel query against the data at time T1 returns all columns.