Use a cluster with Iceberg installed - Amazon EMR

Use a cluster with Iceberg installed

Use an Iceberg cluster with Spark

Starting with Amazon EMR version 6.5.0, you can use Iceberg with your Spark cluster without requiring bootstrap actions. For Amazon EMR versions 6.4.0 and earlier, you can use a bootstrap action to pre-install all necessary dependencies.

In this tutorial, we'll use the AWS CLI to work with Iceberg on an Amazon EMR Spark cluster. To use the console to create a cluster with Iceberg installed, follow the steps in Build an Apache Iceberg data lake using Amazon Athena, Amazon EMR, and AWS Glue.

Create an Iceberg cluster

You can create a cluster with Iceberg installed using the AWS Management Console, the AWS CLI or the Amazon EMR API. In this tutorial, we'll use the AWS CLI to work with Iceberg on an Amazon EMR cluster. To use the console to create a cluster with Iceberg installed, follow the steps in Build an Apache Iceberg data lake using Amazon Athena, Amazon EMR, and AWS Glue.

To use Iceberg on Amazon EMR with the AWS CLI, first create a cluster with the following steps. For information on specifying the Iceberg classification using the AWS CLI, see Supply a configuration using the AWS CLI when you create a cluster or Supply a configuration using the Java SDK when you create a cluster.

  1. Create a file, configurations.json, with the following content:

    [{ "Classification":"iceberg-defaults", "Properties":{"iceberg.enabled":"true"} }]
  2. Next, create a cluster with the following configuration, replacing the example Amazon S3 bucket path and the subnet ID with your own.

    aws emr create-cluster --release-label emr-6.5.0 \ --applications Name=Spark \ --configurations file://iceberg_configurations.json \ --region us-east-1 \ --name My_Spark_Iceberg_Cluster \ --log-uri s3://DOC-EXAMPLE-BUCKET/ \ --instance-type m5.xlarge \ --instance-count 2 \ --service-role EMR_DefaultRole \ --ec2-attributes InstanceProfile=EMR_EC2_DefaultRole,SubnetId=subnet-1234567890abcdef0

Alternatively, you can create an Amazon EMR cluster including the Spark application and include the file /usr/share/aws/iceberg/lib/iceberg-spark3-runtime.jar as a JAR dependency in a Spark job. For more information, see Submitting Applications.

To include the jar as a dependency in a Spark job, you can add the following configuration property to the Spark application:

--conf "spark.jars=/usr/share/aws/iceberg/lib/iceberg-spark3-runtime.jar"

For more information about Spark job dependencies, see .

Initialize a Spark session for Iceberg

The following examples demonstrate how to launch the interactive Spark shell, use Spark submit, or use Amazon EMR Notebooks to work with Iceberg on Amazon EMR.

spark-shell
  1. Connect to the master node using SSH. For more information, see Connect to the master node using SSH in the Amazon EMR Management Guide.

  2. Enter the following command to launch the Spark shell. To use the PySpark shell, replace spark-shell with pyspark.

    spark-shell \ --conf "spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions" \ --conf "spark.sql.catalog.dev=org.apache.iceberg.spark.SparkCatalog" \ --conf "spark.sql.catalog.dev.type=hadoop" \ --conf "spark.sql.catalog.dev.warehouse=s3://DOC-EXAMPLE-BUCKET/example-prefix/"
spark-submit
  1. Connect to the master node using SSH. For more information, see Connect to the master node using SSH in the Amazon EMR Management Guide.

  2. Enter the following command to launch the Spark session for Iceberg.

    spark-submit \ --conf "spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions" \ --conf "spark.sql.catalog.dev=org.apache.iceberg.spark.SparkCatalog" \ --conf "spark.sql.catalog.dev.type=hadoop" \ --conf "spark.sql.catalog.dev.warehouse=s3://DOC-EXAMPLE-BUCKET/example-prefix/"
EMR Studio notebooks

To initialize a Spark session using EMR Studio notebooks, configure your Spark session using the %%configure magic command in your Amazon EMR notebook, as in the following example. For more information, see Use EMR Notebooks magics in the Amazon EMR Management Guide.

%%configure -f { "conf":{ "spark.sql.extensions":"org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions", "spark.sql.catalog.dev":"org.apache.iceberg.spark.SparkCatalog", "spark.sql.catalog.dev.type":"hadoop", "spark.sql.catalog.dev.warehouse":"s3://DOC-EXAMPLE-BUCKET/example-prefix/" } }

Write to an Iceberg table

The following example shows how to create a DataFrame and write it as an Iceberg dataset. The examples demonstrate working with datasets using the Spark shell while connected to the master node using SSH as the default hadoop user.

Note

To paste code samples into the Spark shell, type :paste at the prompt, paste the example, and then press CTRL+D.

PySpark

Spark includes a Python-based shell, pyspark, that you can use to prototype Spark programs written in Python. Just as with spark-shell, invoke pyspark on the master node.

## Create a DataFrame data = spark.createDataFrame([ ("100", "2015-01-01", "2015-01-01T13:51:39.340396Z"), ("101", "2015-01-01", "2015-01-01T12:14:58.597216Z"), ("102", "2015-01-01", "2015-01-01T13:51:40.417052Z"), ("103", "2015-01-01", "2015-01-01T13:51:40.519832Z") ],["id", "creation_date", "last_update_time"]) ## Write a DataFrame as a Iceberg dataset to the S3 location spark.sql("""CREATE TABLE IF NOT EXISTS dev.db.iceberg_table (id string, creation_date string, last_update_time string) USING iceberg location 's3://DOC-EXAMPLE-BUCKET/example-prefix/db/iceberg_table'""") data.writeTo("dev.db.iceberg_table").append()
Scala
import org.apache.spark.sql.SaveMode import org.apache.spark.sql.functions._ // Create a DataFrame val data = Seq( ("100", "2015-01-01", "2015-01-01T13:51:39.340396Z"), ("101", "2015-01-01", "2015-01-01T12:14:58.597216Z"), ("102", "2015-01-01", "2015-01-01T13:51:40.417052Z"), ("103", "2015-01-01", "2015-01-01T13:51:40.519832Z") ).toDF("id", "creation_date", "last_update_time") // Write a DataFrame as a Iceberg dataset to the S3 location spark.sql("""CREATE TABLE IF NOT EXISTS dev.db.iceberg_table (id string, creation_date string, last_update_time string) USING iceberg location 's3://DOC-EXAMPLE-BUCKET/example-prefix/db/iceberg_table'""") data.writeTo("dev.db.iceberg_table").append()

Read from an Iceberg table

PySpark
df = spark.read.format("iceberg").load("dev.db.iceberg_table") df.show()
Scala
val df = spark.read.format("iceberg").load("dev.db.iceberg_table") df.show()
Spark SQL
SELECT * from dev.db.iceberg_table LIMIT 10

Configure Spark properties to use the AWS Glue Data Catalog as Iceberg tables metastore

To use the AWS Glue Catalog as the Metastore for Iceberg tables, set the Spark configuration properties as below:

spark-submit \ --conf spark.sql.catalog.my_catalog=org.apache.iceberg.spark.SparkCatalog \ --conf spark.sql.catalog.my_catalog.warehouse=s3://<bucket>/<prefix> \ --conf spark.sql.catalog.my_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog \ --conf spark.sql.catalog.my_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO \ --conf spark.sql.catalog.my_catalog.lock-impl=org.apache.iceberg.aws.dynamodb.DynamoDbLockManager \ --conf spark.sql.catalog.my_catalog.lock.table=myGlueLockTable

Use an Iceberg cluster with Trino

Starting with Amazon EMR version 6.6.0, you can use Iceberg with your Trino cluster.

In this tutorial, we'll use the AWS CLI to work with Iceberg on an Amazon EMR Trino cluster. To use the console to create a cluster with Iceberg installed, follow the steps in Build an Apache Iceberg data lake using Amazon Athena, Amazon EMR, and AWS Glue.

Create an Iceberg cluster

To use Iceberg on Amazon EMR with the AWS CLI, first create a cluster with the following steps. For information on specifying the Iceberg classification using the AWS CLI, see Supply a configuration using the AWS CLI when you create a cluster or Supply a configuration using the Java SDK when you create a cluster.

  1. Create a file, iceberg.properties, and set a value for your chosen catalog. For example, if you want to use the Hive meastore as your catalog, your file should have the following content.

    connector.name=iceberg hive.metastore.uri=thrift://localhost:9083

    If you want to use the AWS Glue Data Catalog as your store, your file should have the following content.

    connector.name=iceberg iceberg.catalog.type=glue
  2. Create a bootstrap action that copies iceberg.properties from Amazon S3 to /etc/trino/conf/catalog/iceberg.properties, as in the following example. For information on bootstrap actions, see Create bootstrap actions to install additional software.

    set -ex sudo aws s3 cp s3://DOC-EXAMPLE-BUCKET/iceberg.properties /etc/trino/conf/catalog/iceberg.properties
  3. Create a cluster with the following configuration, replacing the example bootstrap actions script path and key name with your own.

    aws emr create-cluster --release-label emr-6.7.0 \ --applications Name=Trino \ --region us-east-1 \ --name My_Trino_Iceberg_Cluster \ --bootstrap-actions '[{"Path":"s3://DOC-EXAMPLE-BUCKET","Name":"Add iceberg.properties"}]' \ --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=c3.4xlarge InstanceGroupType=CORE,InstanceCount=3,InstanceType=c3.4xlarge \ --use-default-roles \ --ec2-attributes KeyName=<key-name>

Initialize a Trino session for Iceberg

To initialize a Trino session, run the following command.

trino-cli --catalog iceberg

Write to an Iceberg table

Create and write to your table with the following SQL commands.

trino> SHOW SCHEMAS; trino> CREATE TABLE default.iceberg_table ( id int, data varchar, category varchar) WITH ( format = 'PARQUET', partitioning = ARRAY['category', 'bucket(id, 16)'], location = 's3://DOC-EXAMPLE-BUCKET/<prefix>') trino> INSERT INTO default.iceberg_table VALUES (1,'a','c1'), (2,'b','c2'), (3,'c','c3');

Read from a table for Iceberg

To read from your Iceberg table, run the following command.

trino> SELECT * from default.iceberg_table;