Using a cluster with Delta Lake installed
Using a Delta Lake cluster with Spark
Starting with Amazon EMR version 6.9.0, you can use Delta Lake with your Spark cluster without requiring bootstrap actions. For Amazon EMR release 6.8.0 and earlier, you can use bootstrap actions to pre-install all necessary dependencies.
In this tutorial, we will use the AWS CLI to work with Delta Lake on an Amazon EMR Spark cluster.
To use Delta Lake on Amazon EMR with the AWS Command Line Interface, first create a cluster with the following steps. For information on specifying the Delta Lake classification with AWS Command Line Interface, see Supply a configuration using the AWS Command Line Interface when you create a cluster or Supply a configuration using the Java SDK when you create a cluster.
-
Create a file,
configurations.json
, with the following content:[{"Classification":"delta-defaults", "Properties":{"delta.enabled":"true"} }]
-
Create a cluster with the following configuration, replacing the example Amazon S3
bucket path
and thesubnet ID
with your own.aws emr create-cluster --release-label emr-6.9.0 --applications Name=Spark --configurations file://delta_configurations.json --region us-east-1 --name My_Spark_Delta_Cluster --log-uri s3://
DOC-EXAMPLE-BUCKET/
--instance-type m5.xlarge --instance-count 2 --service-role EMR_DefaultRole_V2 --ec2-attributes InstanceProfile=EMR_EC2_DefaultRole,SubnetId=subnet-1234567890abcdef0
Alternatively, you can create an Amazon EMR cluster including the Spark application and include following files as JAR dependencies in a Spark job.:
/usr/share/aws/delta/lib/delta-core.jar, /usr/share/aws/delta/lib/delta-storage.jar, /usr/share/aws/delta/lib/delta-storge-s3-dynamodb.jar
For more information, see Submitting Applications
. To include a jar dependency in the Spark job, you can add the following configuration properties to the Spark application:
--conf “spark.jars=/usr/share/aws/delta/lib/delta-core.jar, /usr/share/aws/delta/lib/delta-storage.jar, /usr/share/aws/delta/lib/delta-storage-s3-dynamodb.jar"
For more information about Spark job dependencies, see Dependency Management
.
Initialize a Spark session for Delta Lake
The following examples show how to launch the interactive Spark shell, use Spark submit, or use Amazon EMR Notebooks to work with Delta Lake on Amazon EMR.
Write to a Delta Lake table
The following example shows how to create a DataFrame and write it as a Delta Lake dataset. The example shows how to work with datasets with the Spark shell while connected to the primary node using SSH as the default hadoop user.
To paste code samples into the Spark shell, type :paste at the prompt, paste the example, and then press CTRL + D.
Read from a Delta Lake table
Use Delta Lake cluster with Spark and AWS Glue
To use the AWS Glue Catalog as the Metastore for Delta Lake tables, create a cluster with following steps. For information on specifying the Delta Lake classification using AWS Command Line Interface, see Supply a configuration using the AWS Command Line Interface when you create a cluster or Supply a configuration using the Java SDK when you create a cluster.
Create a Delta Lake cluster
-
Create a file,
configurations.json
, with the following content:[{"Classification":"delta-defaults", "Properties":{"delta.enabled":"true"}}, {"Classification":"spark-hive-site", "Properties":{"hive.metastore.client.factory.class":"com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"}}]
-
Create a cluster with the following configuration, replacing the
example Amazon S3 bucket path
and thesubnet ID
with your own.aws emr create-cluster --release-label emr-6.9.0 --applications Name=Spark --configurations file://delta_configurations.json --region us-east-1 --name My_Spark_Delta_Cluster --log-uri
s3://DOC-EXAMPLE-BUCKET/
--instance-type m5.xlarge --instance-count 2 --service-role EMR_DefaultRole_V2 --ec2-attributes InstanceProfile=EMR_EC2_DefaultRole,SubnetId=subnet-1234567890abcdef0
Use Delta Lake cluster with Trino
Starting with Amazon EMR release 6.9.0, you can use Delta Lake with your Trino cluster.
In this tutorial, we will use the AWS CLI to work with Delta Lake on Amazon EMR Trino cluster.
Create a Delta Lake cluster
-
Create a file,
delta_configurations.json
, and set values for your chosen catalog. For example, if you want to use the Hive metastore as your catalog, your file should have the following content:[{"Classification":"delta-defaults", "Properties":{"delta.enabled":"true"}}, {"Classification":"trino-connector-delta", "Properties":{"hive.metastore.uri":"thrift://localhost:9083"}}]
If you want to use the AWS Glue Catalog as your store, your file should have the following content:
[{"Classification":"delta-defaults", "Properties":{"delta.enabled":"true"}}, {"Classification":"trino-connector-delta", "Properties":{"hive.metastore":"glue"}}]
-
Create a cluster with the following configuration, replacing the
example Amazon S3 bucket path
and thesubnet ID
with your own.aws emr create-cluster --release-label emr-6.9.0 --applications Name=Trino --configurations file://delta_configurations.json --region us-east-1 --name My_Spark_Delta_Cluster --log-uri
s3://DOC-EXAMPLE-BUCKET/
--instance-type m5.xlarge --instance-count 2 --service-role EMR_DefaultRole_V2 --ec2-attributes InstanceProfile=EMR_EC2_DefaultRole,SubnetId=subnet-1234567890abcdef0
Initialize Trino session for Delta Lake
To initialize Trino session, run the following command
trino-cli --catalog delta
Write to a Delta Lake table
Create and write to your table with the following SQL commands:
SHOW SCHEMAS; CREATE TABLE default.delta_table (id int, data varchar, category varchar) WITH ( location =
's3://DOC-EXAMPLE-BUCKET/<prefix>'
); INSERT INTO default.delta_table VALUES (1,'a','c1'), (2,'b','c2'), (3,'c','c3');
Read from a Delta Lake table
Read from your table with the following SQL command:
SELECT * from default.delta_table;