Transactional Data Operations - AWS Lake Formation

Transactional Data Operations

ACID transactions enable multiple jobs to concurrently and reliably update governed tables by adding and removing Amazon S3 objects, while maintaining read consistency for each query against the data lake. This makes it easy to keep your data lakes continuously up to date while simultaneously running analytics queries and machine learning (ML) transforms that return consistent results. Each transaction gets its own consistent view of the data lake.

An example of the read consistency that ACID transactions provide is when a user starts a query in Amazon Athena that scans all partitions of a table to aggregate sales figures. While the scan is proceeding and before it completes, an ETL job adds a partition to the table. If the user performed the query within a transaction, they will see query results that reflect the state of the table at the start of the query. The results won't include data from the new partition created by the job.

Lake Formation operations that take part in transactions access and modify governed tables in the AWS Glue Data Catalog. A transaction can involve multiple governed tables.

Transactions also provide ACID properties for changes involving the metadata that defines the Amazon S3 objects that make up a governed table (the manifest), as well as the table schema. For more information about governed tables, see Governed Tables in Lake Formation.

Integrated AWS services such as Athena automatically perform queries within transactions when governed tables are involved. To use transactions in your AWS Glue ETL jobs, the job script begins a transaction before performing any reads or writes to the data lake, references the transaction ID for any operation within the transaction, and commits the transaction upon completion. If a read or write operation fails, the job script can retry the operation, or optionally cancel the transaction. You can also use transactions in your queries by specifying the past point in time to read from.

Lake Formation might automatically cancel a transaction if Lake Formation detects certain failures, such as conflicting writes. Canceling a transaction causes all operations to be rolled back, as described in Reading from and Writing to the Data Lake Within Transactions.

To enable Lake Formation to distinguish long-running transactions (for example, Spark ETL jobs that run for many hours) from transactions that are abandoned due to crashes, your long-running write transactions should call the heartbeat API operation ExtendTransaction regularly. This isn't necessary for read-only transactions. Lake Formation automatically cancels transactions that are idle for too long.

Note

Modifications to governed tables must be made within the context of a transaction. If your ETL job performs an operation on a governed table without explicitly providing a transaction ID, Lake Formation automatically starts a transaction and commits (or cancels) the transaction at the end of the operation. This is referred to as a single-statement transaction.

For a transaction that has write operations, a call to CommitTransaction will move the transaction to COMMIT_IN_PROGRESS state. A subsequent read operation may or may not reflect the result of the write operation. In order to deterministically read the result of the write operation, customers should wait until the status of transaction changes to COMMITTED. This can be checked by either calling CommitTransaction or DescribteTransaction API. Read operation for a single-statement transaction also demostrates the same behavior.

Rolling Back Amazon S3 Writes

When a transaction is canceled, either automatically or by a call to CancelTransaction, Lake Formation never deletes data that was written to Amazon S3 without your permission. To grant permission to Lake Formation to roll back writes made during a transaction, your code must call the DeleteObjectsOnCancel API operation, which lists the Amazon S3 objects that can be deleted if the transaction is canceled. It's recommended that you call DeleteObjectsOnCancel before making the writes.

The AWS Glue ETL library function write_dynamic_frame.from_catalog() includes an option to automatically call DeleteObjectsOnCancel before writes. In the following example, the callDeleteObjectsOnCancel option is included in the additional_options argument. Because the value False is passed to the read_only argument of start_transaction, the transaction is not a read-only transaction.

transactionId = glueContext.start_transaction(False) try: datasink0 = glueContext.write_dynamic_frame.from_catalog( frame = datasource0, database="MyDatabase", table_name="MyGovernedTable", additional_options={ "partitionKeys":["key1", "key2"], "transactionId":transactionId, "callDeleteObjectsOnCancel":"true" } ) glueContext.commit_transaction(transactionId) except: glueContext.cancel_transaction(transactionId)