Querying Amazon S3 tables with Apache Spark - Amazon Simple Storage Service

Querying Amazon S3 tables with Apache Spark

Apache Spark is an open-source analytics engine for processing large amounts of data. You can use the Amazon S3 Tables Catalog for Apache Iceberg client catalog to query tables from open-source applications using Spark. You can also use the client catalog to query tables using Spark on Amazon EMR. For more information, see Accessing Amazon S3 tables with Amazon EMR.

When you initialize a Spark session for Apache Iceberg, you use the client catalog as an optional package and your table bucket as the warehouse directory.

To initialize a Spark session for working with S3 tables
  • Initialize Spark using the following command. To use the command, replace the replace the Amazon S3 Tables Catalog for Apache Iceberg version number with the latest version from AWS Labs GitHub repository, and the table bucket ARN with your own table bucket ARN.

    spark-shell \ --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.6.1,software.amazon.s3tables:s3-tables-catalog-for-iceberg-runtime:0.1.4 \ --conf spark.sql.catalog.s3tablesbucket=org.apache.iceberg.spark.SparkCatalog \ --conf spark.sql.catalog.s3tablesbucket.catalog-impl=software.amazon.s3tables.iceberg.S3TablesCatalog \ --conf spark.sql.catalog.s3tablesbucket.warehouse=arn:aws:s3tables:us-east-1:111122223333:bucket/amzn-s3-demo-table-bucket \ --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions

Querying S3 tables with Spark

Using Spark, you can run DQL, DML, and DDL operations on S3 tables. The following example queries show some ways you can interact with S3 tables. To use these example queries in your query engine, replace the user input placeholder values with your own:

To query tables with Spark
  • Create a namespace

    spark.sql(" CREATE NAMESPACE IF NOT EXISTS s3tablesbucket.my_namespace")
  • Create a table

    spark.sql(" CREATE TABLE IF NOT EXISTS s3tablesbucket.my_namespace.`my_table` ( id INT, name STRING, value INT ) USING iceberg ")
  • Query a table

    spark.sql(" SELECT * FROM s3tablesbucket.my_namespace.`my_table` ").show()
  • Insert data into a table

    spark.sql( """ INSERT INTO s3tablesbucket.my_namespace.my_table VALUES (1, 'ABC', 100), (2, 'XYZ', 200) """)
  • Load an existing data file into a table

    1. Read the data into Spark.

      val data_file_location = "Path such as S3 URI to data file" val data_file = spark.read.parquet(data_file_location)
    2. Write the data into an Iceberg table.

      data_file.writeTo("s3tablesbucket.my_namespace.my_table").using("Iceberg").tableProperty ("format-version", "2").createOrReplace()