Querying Amazon S3 tables with Apache Spark
Apache Spark is an open-source analytics engine for processing large
amounts of data. You can use the Amazon S3 Tables Catalog for Apache Iceberg client catalog to query tables from
open-source applications using Spark
When you initialize a Spark session for Apache Iceberg, you use the client catalog as an optional package and your table bucket as the warehouse directory.
To initialize a Spark session for working with S3 tables
-
Initialize Spark using the following command. To use the command, replace the replace the Amazon S3 Tables Catalog for Apache Iceberg
version number
with the latest version from AWS Labs GitHub repository, and the table bucket ARN
with your own table bucket ARN.spark-shell \ --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.6.1,software.amazon.s3tables:s3-tables-catalog-for-iceberg-runtime:
0.1.4
\ --conf spark.sql.catalog.s3tablesbucket=org.apache.iceberg.spark.SparkCatalog \ --conf spark.sql.catalog.s3tablesbucket.catalog-impl=software.amazon.s3tables.iceberg.S3TablesCatalog \ --conf spark.sql.catalog.s3tablesbucket.warehouse=arn:aws:s3tables:us-east-1
:111122223333
:bucket/amzn-s3-demo-table-bucket
\ --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
Querying S3 tables with Spark
Using Spark, you can run DQL, DML, and DDL operations on S3 tables. The following
example queries show some ways you can interact with S3 tables. To use these example queries in your
query engine, replace the user input placeholder
values with your own:
To query tables with Spark
-
Create a namespace
spark.sql(" CREATE NAMESPACE IF NOT EXISTS s3tablesbucket.
my_namespace
") -
Create a table
spark.sql(" CREATE TABLE IF NOT EXISTS s3tablesbucket.
my_namespace
.`my_table
` ( id INT, name STRING, value INT ) USING iceberg ") -
Query a table
spark.sql(" SELECT * FROM s3tablesbucket.
my_namespace
.`my_table
` ").show() -
Insert data into a table
spark.sql( """ INSERT INTO s3tablesbucket.
my_namespace
.my_table
VALUES (1, 'ABC', 100), (2, 'XYZ', 200) """) -
Load an existing data file into a table
Read the data into Spark.
val data_file_location = "Path such as S3 URI to data file" val data_file = spark.read.parquet(
data_file_location
)Write the data into an Iceberg table.
data_file.writeTo("s3tablesbucket.
my_namespace
.my_table
").using("Iceberg").tableProperty ("format-version", "2").createOrReplace()