Using Delta Lake OSS with EMR Serverless
Amazon EMR versions 6.9.0 and higher
Note
Amazon EMR 7.0.0 and higher uses Delta Lake 3.0.0, which renames the delta-core.jar
file to delta-spark.jar
. If you use Amazon EMR 7.0.0 or higher, make sure to
specify delta-spark.jar
in your configurations.
Amazon EMR 6.9.0 and higher includes Delta Lake, so you no longer have to package Delta Lake
yourself or provide the --packages
flag with your EMR Serverless jobs.
-
When you submit EMR Serverless jobs, make sure that you have the following configuration properties and include the following parameters in the
sparkSubmitParameters
field.--conf spark.jars=/usr/share/aws/delta/lib/delta-core.jar,/usr/share/aws/delta/lib/delta-storage.jar --conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog
-
Create a local
delta_sample.py
to test creating and reading a Delta table.# delta_sample.py from pyspark.sql import SparkSession import uuid url = "s3://
DOC-EXAMPLE-BUCKET
/delta-lake/output/%s/" % str(uuid.uuid4()) spark = SparkSession.builder.appName("DeltaSample").getOrCreate() ## creates a Delta table and outputs to target S3 bucket spark.range(5).write.format("delta").save(url) ## reads a Delta table and outputs to target S3 bucket spark.read.format("delta").load(url).show -
Using the AWS CLI, upload the
delta_sample.py
file to your Amazon S3 bucket. Then use thestart-job-run
command to submit a job to an existing EMR Serverless application.aws s3 cp delta_sample.py s3://
DOC-EXAMPLE-BUCKET
/code/ aws emr-serverless start-job-run \ --application-idapplication-id
\ --execution-role-arnjob-role-arn
\ --name emr-delta \ --job-driver '{ "sparkSubmit": { "entryPoint": "s3://DOC-EXAMPLE-BUCKET
/code/delta_sample.py", "sparkSubmitParameters": "--conf spark.jars=/usr/share/aws/delta/lib/delta-core.jar,/usr/share/aws/delta/lib/delta-storage.jar --conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog" } }'
To use Python libraries with Delta Lake, you can add the delta-core
library by
packaging it as a dependency or by using it as a custom image.
Alternatively, you can use the SparkContext.addPyFile
to add the Python libraries from the delta-core
JAR file:
import glob from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() spark.sparkContext.addPyFile(glob.glob("/usr/share/aws/delta/lib/delta-core_*.jar")[0])
Amazon EMR versions 6.8.0 and lower
If you're using Amazon EMR 6.8.0 or lower, follow these steps to use Delta Lake OSS with your EMR Serverless applications.
-
To build an open source version of Delta Lake
that’s compatible with the version of Spark on your Amazon EMR Serverless application, navigate to the Delta GitHub and follow the instructions. -
Upload the Delta Lake libraries to an Amazon S3 bucket in your AWS account.
-
When you submit EMR Serverless jobs in the application configuration, include the Delta Lake JAR files that are now in your bucket.
--conf spark.jars=s3://
DOC-EXAMPLE-BUCKET
/jars/delta-core_2.12-1.1.0.jar -
To ensure that you can read to and write from a Delta table, run a sample PySpark test.
from pyspark import SparkConf, SparkContext from pyspark.sql import HiveContext, SparkSession import uuid conf = SparkConf() sc = SparkContext(conf=conf) sqlContext = HiveContext(sc) url = "s3://
DOC-EXAMPLE-BUCKET
/delta-lake/output/1.0.1/%s/" % str(uuid.uuid4()) ## creates a Delta table and outputs to target S3 bucket session.range(5).write.format("delta").save(url) ## reads a Delta table and outputs to target S3 bucket session.read.format("delta").load(url).show