Working with Apache Iceberg in AWS Glue - AWS Prescriptive Guidance

Working with Apache Iceberg in AWS Glue

AWS Glue is a serverless data integration service that makes it easier to discover, prepare, move, and integrate data from multiple sources for analytics, machine learning (ML), and application development. One of the core capabilities of AWS Glue is its ability to perform extract, transform, and load (ETL) operations in a simple and cost-effective manner. This helps categorize your data, clean it, enrich it, and move it reliably between various data stores and data streams. 

AWS Glue jobs encapsulate scripts that define transformation logic by using an Apache Spark or Python runtime. AWS Glue jobs can be run in both batch and streaming mode. 

When you create Iceberg jobs in AWS Glue, depending on the version of AWS Glue, you can use either native Iceberg integration or a custom Iceberg version to attach Iceberg dependencies to the job.

Using native Iceberg integration

AWS Glue versions 3.0 and 4.0 natively support transactional data lake formats such as Apache Iceberg, Apache Hudi, and Linux Foundation Delta Lake in AWS Glue for Spark. This integration feature simplifies the configuration steps required to start using these frameworks in AWS Glue.

To enable Iceberg support for your AWS Glue job, set the job: Choose the Job details tab for your AWS Glue job, scroll to Job parameters under Advanced properties, and set the key to --datalake-formats  and its value to iceberg.

If you are authoring a job by using a notebook, you can configure the parameter in the first notebook cell by using the %%configure magic as follows:

%%configure { "--conf" : <job-specific Spark configuration discussed later>, "--datalake-formats" : "iceberg" }

Using a custom Iceberg version

In some situations, you might want to retain control over the Iceberg version for the job and upgrade it at your own pace. For example, upgrading to a later version can unlock access to new features and performance enhancements. To use a specific Iceberg version with AWS Glue, you can use a custom connector or your own JAR file.

Using a custom connector

AWS Glue supports connectors, which are optional code packages that assist with accessing data stores in AWS Glue Studio. You can subscribe to a connector in AWS Marketplace, or you can create a custom connector.

Note

AWS Marketplace offers the Apache Iceberg connector for AWS Glue. However, we recommend that you use a custom connector instead to retain control over Iceberg versions.

For example, to create a customer connector for Iceberg version 0.13.1, follow these steps:

  1. Upload the files iceberg-spark-runtime-3.1_2.12-0.13.1.jar, bundle-2.17.161.jar, and url-connection-client-2.17.161.jar to an Amazon S3 bucket. You can download these files from their respective Apache Maven repositories.

  2. On the AWS Glue Studio console, create a custom Spark connector:

    1. In the navigation pane, choose Data connections. (If you're using the older navigation, choose Connectors, Create custom connector.)

    2. In the Connectors box, choose Create custom connector.

    3. On the Create custom connector page:

      • Specify the path to the JAR files in Amazon S3.

      • Enter a name for the connector.

      • Choose Spark as the connector type.

      • For Class name, specify the fully qualified data source class name (or its alias) that you use when loading the Spark data source with the format operator.

      • (Optional) Provide a description of the connector.

3. Choose Create connector.

When you work with connectors in AWS Glue, you must create a connection for the connector. A connection contains the properties that are required to connect to a particular data store. You use the connection with your data sources and data targets in the ETL job. Connectors and connections work together to facilitate access to the data stores.

To create a connection by using the custom Iceberg connector you created:

  1. On the AWS Glue Studio console, select your custom Iceberg connector.

  2. Follow the prompts to supply the details, such as your VPC and other network configurations required by the job, and then choose Create connection.

You can now use the connection in your AWS Glue ETL job. Depending on how you create the job, there are different ways to attach the connection to your job:

  • If you create a visual job by using AWS Glue Studio, you can select the connection from the Connection list on the Data source properties – Connector tab.

  • If you develop the job in a notebook, use the %connections magic to set the connection name:

    %glue_version 3.0 %connections <name-of-the iceberg-connection> %%configure { "--conf" : "job-specific Spark configurations, to be discussed later", "--datalake-formats" : "iceberg" }
  • If you author the job by using the script editor, specify the connection on the Job details tab, under Advanced properties, Additional network connections

For more information about the procedures in this section, see Using connectors and connections with AWS Glue Studio in the AWS Glue documentation.

Bringing your own JAR files

In AWS Glue, you can also work with Iceberg without having to use a connector. This approach is useful when you want to retain control over the Iceberg version and quickly update it. To use this option, upload the required Iceberg JAR files into an S3 bucket of your choice and reference the files in your AWS Glue job. For example, if you're working with Iceberg 1.0.0, the required JAR files are iceberg-spark-runtime-3.0_2.12-1.0.0.jar, url-connection-client-2.15.40.jar, and bundle-2.15.40.jar. You can also prioritize the additional JAR files in the class path by setting the --user-jars-first parameter to true for the job.

Spark configurations for Iceberg in AWS Glue

This section discusses the Spark configurations required to author an AWS Glue ETL job for an Iceberg dataset. You can set these configurations by using the --conf Spark key with a comma-separated list of all Spark configuration keys and values. You can use the %%configure magic in a notebook, or the Job parameters section of the AWS Glue Studio console.

%glue_version 3.0 %connections <name-of-the iceberg-connection> %%configure { "--conf" : "spark.sql.extensions=org.apache.iceberg.spark.extensions...", "--datalake-formats" : "iceberg" }

Configure the Spark session with the following properties:

  • <catalog_name> is your Iceberg Spark session catalog name. Replace it with the name of your catalog, and remember to change the references throughout all configurations that are associated with this catalog. In your code, you should then refer to your Iceberg tables with the fully qualified table name, including the Spark session catalog name, as follows: <catalog_name>.<database_name>.<table_name>.

  • <catalog_name>.<warehouse> points to the Amazon S3 path where you want to store your data and metadata.

  • To make the catalog an AWS Glue Data Catalog, set <catalog_name>.catalog-impl to org.apache.iceberg.aws.glue.GlueCatalog. This key is required to point to an implementation class for any custom catalog implementation. For catalogs supported by Iceberg, see the General best practicesGeneral best practices section later in this guide.

  • Use org.apache.iceberg.aws.s3.S3FileIO as the <catalog_name>.io-impl in order to take advantage of Amazon S3 multipart upload for high parallelism.

For example, if you have a catalog called glue_iceberg, you can configure your job by using multiple --conf keys as follows:

%%configure { "‐‐datalake-formats" : "iceberg", "‐‐conf" : "spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions", "‐‐conf" : "spark.sql.catalog.glue_iceberg=org.apache.iceberg.spark.SparkCatalog", "‐‐conf" : "spark.sql.catalog.glue_iceberg.warehouse=s3://<your-warehouse-dir>=>/", "‐‐conf" : " spark.sql.catalog.glue_iceberg.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog ", "‐‐conf" : " spark.sql.catalog.glue_iceberg.io-impl=org.apache.iceberg.aws.s3.S3FileIO }

Alternatively, you can use code to add the above configurations to your Spark script as follows:

spark = SparkSession.builder\ .config("spark.sql.extensions","org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")\ .config("spark.sql.catalog.glue_iceberg", "org.apache.iceberg.spark.SparkCatalog")\ .config("spark.sql.catalog.glue_iceberg.warehouse","s3://<your-warehouse-dir>/")\ .config("spark.sql.catalog.glue_iceberg.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog") \ .config("spark.sql.catalog.glue_iceberg.io-impl", "org.apache.iceberg.aws.s3.S3FileIO") \ .getOrCreate()

Best practices for AWS Glue jobs

This section provides general guidelines for tuning Spark jobs in AWS Glue to optimize reading and writing data to Iceberg tables. For Iceberg-specific best practices, see the Best practices section later in this guide.

  • Use the latest version of AWS Glue and upgrade whenever possible – New versions of AWS Glue provide performance improvements, reduced startup times, and new features. They also support newer Spark versions that might be required for the latest Iceberg versions. For a list of available AWS Glue versions and the Spark versions they support, see the AWS Glue documentation.

  • Optimize AWS Glue job memory – Follow the recommendations in the AWS blog post Optimize memory management in AWS Glue.

  • Use AWS Glue Auto Scaling – When you enable Auto Scaling, AWS Glue automatically adjusts the number of AWS Glue workers dynamically based on your workload. This helps reduce the cost of your AWS Glue job during peak loads, because AWS Glue scales down the number of workers when the workload is small and workers are sitting idle. To use AWS Glue Auto Scaling, you specify a maximum number of workers that your AWS Glue job can scale to. For more information, see Using auto scaling for AWS Glue in the AWS Glue documentation.

  • Use custom connectors or add library dependencies - AWS Glue native integration for Iceberg is best for getting started with Iceberg. However, for production workloads, we recommend that you use custom containers or add library dependencies (as discussed earlier in this guide) to get full control over the Iceberg version. This approach helps you benefit from the latest Iceberg features and performance improvements in your AWS Glue jobs.

  • Enable the Spark UI for monitoring and debugging – You can also use the Spark UI in AWS Glue to inspect your Iceberg job by visualizing the different stages of a Spark job in a directed acyclic graph (DAG) and monitoring the jobs in detail. Spark UI provides an effective way to both troubleshoot and optimize Iceberg jobs. For example, you can identify bottleneck stages that have large shuffles or disk spill to identify tuning opportunities. For more information, see Monitoring jobs using the Apache Spark web UI in the AWS Glue documentation.