Using the Delta Lake framework in AWS Glue - AWS Glue

Using the Delta Lake framework in AWS Glue

AWS Glue 3.0 and later supports the Linux Foundation Delta Lake framework. Delta Lake is an open-source data lake storage framework that helps you perform ACID transactions, scale metadata handling, and unify streaming and batch data processing. This topic covers available features for using your data in AWS Glue when you transport or store your data in a Delta Lake table. To learn more about Delta Lake, see the official Delta Lake documentation.

You can use AWS Glue to perform read and write operations on Delta Lake tables in Amazon S3, or work with Delta Lake tables using the AWS Glue Data Catalog. Additional operations such as insert, update, and Table batch reads and writes are also supported. When you use Delta Lake tables, you also have the option to use methods from the Delta Lake Python library such as DeltaTable.forPath. For more information about the Delta Lake Python library, see Welcome to Delta Lake’s Python documentation page.

The following table lists the version of Delta Lake included in each AWS Glue version.

AWS Glue version Supported Delta Lake version
4.0 2.1.0
3.0 1.0.0

To learn more about the data lake frameworks that AWS Glue supports, see Using data lake frameworks with AWS Glue ETL jobs.

Enabling Delta Lake for AWS Glue

To enable Delta Lake for AWS Glue, complete the following tasks:

  • Specify delta as a value for the --datalake-formats job parameter. For more information, see AWS Glue job parameters.

  • Create a key named --conf for your AWS Glue job, and set it to the following value. Alternatively, you can set the following configuration using SparkConf in your script. These settings help Apache Spark correctly handle Delta Lake tables.

    spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog

Using a different Delta Lake version

To use a version of Delta lake that AWS Glue doesn't support, specify your own Delta Lake JAR files using the --extra-jars job parameter. Do not include delta as a value for the --datalake-formats job parameter. To use the Delta Lake Python library in this case, you must specify the library JAR files using the --extra-py-files job parameter. The Python library comes packaged in the Delta Lake JAR files.

Example: Write a Delta Lake table to Amazon S3 and register it to the AWS Glue Data Catalog

The following AWS Glue ETL script demonstrates how to write a Delta Lake table to Amazon S3 and register the table to the AWS Glue Data Catalog.

Python
# Example: Create a Delta Lake table from a DataFrame # and register the table to Glue Data Catalog additional_options = { "path": "s3://<s3Path>" } dataFrame.write \ .format("delta") \ .options(**additional_options) \ .mode("append") \ .partitionBy("<your_partitionkey_field>") \ .saveAsTable("<your_database_name>.<your_table_name>")
Scala
// Example: Example: Create a Delta Lake table from a DataFrame // and register the table to Glue Data Catalog val additional_options = Map( "path" -> "s3://<s3Path>" ) dataFrame.write.format("delta") .options(additional_options) .mode("append") .partitionBy("<your_partitionkey_field>") .saveAsTable("<your_database_name>.<your_table_name>")

Example: Read a Delta Lake table from Amazon S3 using the AWS Glue Data Catalog

The following AWS Glue ETL script reads the Delta Lake table that you created in Example: Write a Delta Lake table to Amazon S3 and register it to the AWS Glue Data Catalog.

Python

For this example, use the create_data_frame.from_catalog method.

# Example: Read a Delta Lake table from Glue Data Catalog from awsglue.context import GlueContext from pyspark.context import SparkContext sc = SparkContext() glueContext = GlueContext(sc) df = glueContext.create_data_frame.from_catalog( database="<your_database_name>", table_name="<your_table_name>", additional_options=additional_options )
Scala

For this example, use the getCatalogSource method.

// Example: Read a Delta Lake table from Glue Data Catalog import com.amazonaws.services.glue.GlueContext import org.apacke.spark.SparkContext object GlueApp { def main(sysArgs: Array[String]): Unit = { val spark: SparkContext = new SparkContext() val glueContext: GlueContext = new GlueContext(spark) val df = glueContext.getCatalogSource("<your_database_name>", "<your_table_name>", additionalOptions = additionalOptions) .getDataFrame() } }

Example: Insert a DataFrame into a Delta Lake table in Amazon S3 using the AWS Glue Data Catalog

This example inserts data into the Delta Lake table that you created in Example: Write a Delta Lake table to Amazon S3 and register it to the AWS Glue Data Catalog.

Note

This example requires you to set the --enable-glue-datacatalog job parameter in order to use the AWS Glue Data Catalog as an Apache Spark Hive metastore. To learn more, see AWS Glue job parameters.

Python

For this example, use the write_data_frame.from_catalog method.

# Example: Insert into a Delta Lake table in S3 using Glue Data Catalog from awsglue.context import GlueContext from pyspark.context import SparkContext sc = SparkContext() glueContext = GlueContext(sc) glueContext.write_data_frame.from_catalog( frame=dataFrame, database="<your_database_name>", table_name="<your_table_name>", additional_options=additional_options )
Scala

For this example, use the getCatalogSink method.

// Example: Insert into a Delta Lake table in S3 using Glue Data Catalog import com.amazonaws.services.glue.GlueContext import org.apacke.spark.SparkContext object GlueApp { def main(sysArgs: Array[String]): Unit = { val spark: SparkContext = new SparkContext() val glueContext: GlueContext = new GlueContext(spark) glueContext.getCatalogSink("<your_database_name>", "<your_table_name>", additionalOptions = additionalOptions) .writeDataFrame(dataFrame, glueContext) } }

Example: Read a Delta Lake table from Amazon S3 using the Spark API

This example reads a Delta Lake table from Amazon S3 using the Spark API.

Python
# Example: Read a Delta Lake table from S3 using a Spark DataFrame dataFrame = spark.read.format("delta").load("s3://<s3path/>")
Scala
// Example: Read a Delta Lake table from S3 using a Spark DataFrame val dataFrame = spark.read.format("delta").load("s3://<s3path/>")

Example: Write a Delta Lake table to Amazon S3 using Spark

This example writes a Delta Lake table to Amazon S3 using Spark.

Python
# Example: Write a Delta Lake table to S3 using a Spark DataFrame dataFrame.write.format("delta") \ .options(**additional_options) \ .mode("overwrite") \ .partitionBy("<your_partitionkey_field>") .save("s3://<s3Path>")
Scala
// Example: Write a Delta Lake table to S3 using a Spark DataFrame dataFrame.write.format("delta") .options(additionalOptions) .mode("overwrite") .partitionBy("<your_partitionkey_field>") .save("s3://<s3path/>")