Transactional data operations - AWS Lake Formation

Transactional data operations

Lake Formation ACID transactions provide snapshot isolation, which enables multiple jobs to concurrently and reliably update governed tables by adding and removing Amazon S3 objects, while maintaining read consistency for each query against the data lake.

An example of the read consistency that ACID transactions provide is when a user starts a query in Amazon Athena that scans all partitions of a table to aggregate sales figures. While the scan is proceeding and before it completes, an ETL job adds a partition to the table. If the user performed the query within a transaction, they will see query results that reflect the state of the table at the start of the query. The results won't include data from the new partition created by the job.

Lake Formation operations that take part in transactions access and modify governed tables in the AWS Glue Data Catalog. A transaction can involve multiple governed tables.

Transactions also provide ACID properties for changes involving the metadata that defines the Amazon S3 objects that make up a governed table (the manifest), as well as the table schema. For more information about governed tables, see Governed tables in Lake Formation.

Integrated AWS services such as Athena automatically perform queries within transactions when governed tables are involved. To use transactions in your AWS Glue ETL jobs, the job script begins a transaction before performing any reads or writes to the data lake, references the transaction ID for any operation within the transaction, and commits the transaction upon completion. If a read or write operation fails, the job script can retry the operation, or optionally cancel the transaction. You can also use transactions in your queries by specifying the past point in time to read from.

Lake Formation might automatically cancel a transaction if Lake Formation detects certain failures, such as conflicting writes. Canceling a transaction causes all operations to be rolled back, as described in Reading from and writing to the data lake within transactions.

To enable Lake Formation to distinguish long-running transactions (for example, Spark ETL jobs that run for many hours) from transactions that are abandoned due to crashes, your long-running write transactions should call the heartbeat API operation ExtendTransaction regularly. This isn't necessary for read-only transactions. Lake Formation automatically cancels transactions that are idle for too long.