Reading from and writing to the data lake within transactions - AWS Lake Formation

Reading from and writing to the data lake within transactions

AWS Lake Formation supports ACID (atomic, consistent, isolated, and durable) transactions when reading and writing governed tables composed of Amazon S3 objects, and when creating and updating table metadata in your Data Catalog. Transactions maintain the integrity of governed table manifests (transaction data operations) and of other table metadata such as schema (transactional metadata operations). The following are typical use cases for transactions against governed tables:

  • ETL into new tables – In this use case, you might have an AWS Glue extract, transform, and load (ETL) job that starts a transaction, reads from a data source, writes to a data sink that is a registered Amazon S3 location in the data lake, and creates a governed table in the Data Catalog for the data sink. If the ETL script detects a failure at some point, the script can cancel the transaction, and as a result, the following occurs:

    • The governed table is deleted from the catalog.

    • If the script invokes the Lake Formation DeleteObjectsOnCancel API operation before each new object is written to Amazon S3, then Lake Formation also deletes all the objects that were written to Amazon S3 within the transaction. For more information, see Rolling back Amazon S3 writes.

  • Table updates – Assume that for an existing governed table, your ETL job starts a transaction, writes new objects to Amazon S3 and updates the table manifest by using the UpdateTableObjects API operation. If the script detects a failure, it can cancel the transaction, and as a result, the following occurs:

    • The table manifest is restored to its state before the transaction started.

    • If the script invokes the Lake Formation DeleteObjectsOnCancel API operation before each new object is written to Amazon S3, Lake Formation also deletes all the objects that were written to Amazon S3 within the transaction.

  • Schema updates – For an existing governed table with a streaming data sink in Amazon S3, if the streaming ETL job determines that there are additional table columns in the data, it can update the table schema within a transaction. If a failure occurs, the job can cancel the transaction, in which case the table schema is restored to its state before the transaction started.

  • Time-travel queries – Lake Formation maintains multiple versions (snapshots) of table metadata as the data in the data lake changes. You can travel back in time and query the data even if the schema has changed.

For more information about governed tables, see Governed tables in Lake Formation.

Commit process in governed table

Modifications to governed tables must be made within the context of a transaction. If your ETL job performs an operation on a governed table without explicitly providing a transaction ID, Lake Formation automatically starts a transaction and commits (or cancels) the transaction at the end of the operation. This is referred to as a single-statement transaction.

For a transaction that has write operations, a call to CommitTransaction will move the transaction to COMMIT_IN_PROGRESS state. An internal background process works on applying the changes in the transaction to the governed table before moving the transaction to COMMITTED status. Therefore, a read operation made immediately after calling CommitTransaction may or may not reflect the result of the write operation. In order to deterministically read the result of the write operation, customers should wait until the status of transaction changes to COMMITTED. This can be checked by either calling CommitTransaction or DescribeTransaction API operation. Read operation for a single-statement transaction also demonstrates the same behavior.

Rolling back Amazon S3 writes

When a transaction is canceled, either automatically or by a call to CancelTransaction, Lake Formation never deletes data that was written to Amazon S3 without your permission. To grant permission to Lake Formation to roll back writes made during a transaction, your code must call the DeleteObjectsOnCancel API operation, which lists the Amazon S3 objects that can be deleted if the transaction is canceled. It's recommended that you call DeleteObjectsOnCancel before making the writes.

The AWS Glue ETL library function write_dynamic_frame.from_catalog() includes an option to automatically call DeleteObjectsOnCancel before writes. In the following example, the callDeleteObjectsOnCancel option is included in the additional_options argument. Because the value False is passed to the read_only argument of start_transaction, the transaction is not a read-only transaction.

transactionId = glueContext.start_transaction(False) try: datasink0 = glueContext.write_dynamic_frame.from_catalog( frame = datasource0, database="MyDatabase", table_name="MyGovernedTable", additional_options={ "partitionKeys":["key1", "key2"], "transactionId":transactionId, "callDeleteObjectsOnCancel":"true" } ) glueContext.commit_transaction(transactionId) except: glueContext.cancel_transaction(transactionId)