Migrating AWS Glue for Spark jobs to AWS Glue version 3.0 - AWS Glue

Migrating AWS Glue for Spark jobs to AWS Glue version 3.0

This topic describes the changes between AWS Glue versions 0.9, 1.0, 2.0 and 3.0 to allow you to migrate your Spark applications and ETL jobs to AWS Glue 3.0.

To use this feature with your AWS Glue ETL jobs, choose 3.0 for the Glue version when creating your jobs.

New features supported

This section describes new features and advantages of AWS Glue version 3.0.

  • It is based on Apache Spark 3.1.1, which has optimizations from open-source Spark and developed by the AWS Glue and EMR services such as adaptive query execution, vectorized readers, and optimized shuffles and partition coalescing.

  • Upgraded JDBC drivers for all Glue native sources including MySQL, Microsoft SQL Server, Oracle, PostgreSQL, MongoDB, and upgraded Spark libraries and dependencies brought in by Spark 3.1.1.

  • Optimized Amazon S3 access with upgraded EMRFS and enabled Amazon S3 optimized output committers by default.

  • Optimized Data Catalog access with partition indexes, push down predicates, partition listing, and upgraded Hive metastore client.

  • Integration with Lake Formation for governed catalog tables with cell-level filtering and data lake transactions.

  • Improved Spark UI experience with Spark 3.1.1 with new Spark executor memory metrics and Spark structured streaming metrics.

  • Reduced startup latency improving overall job completion times and interactivity, similar to AWS Glue 2.0.

  • Spark jobs are billed in 1-second increments with a 10x lower minimum billing duration—from a 10-minute minimum to a 1-minute minimum, similar to AWS Glue 2.0.

Actions to migrate to AWS Glue 3.0

For existing jobs, change the Glue version from the previous version to Glue 3.0 in the job configuration.

  • In the console, choose Spark 3.1, Python 3 (Glue Version 3.0) or Spark 3.1, Scala 2 (Glue Version 3.0) in Glue version.

  • In AWS Glue Studio, choose Glue 3.0 - Supports spark 3.1, Scala 2, Python 3 in Glue version.

  • In the API, choose 3.0 in the GlueVersion parameter in the UpdateJob API.

For new jobs, choose Glue 3.0 when you create a job.

  • In the console, choose Spark 3.1, Python 3 (Glue Version 3.0) or Spark 3.1, Scala 2 (Glue Version 3.0) in Glue version.

  • In AWS Glue Studio, choose Glue 3.0 - Supports spark 3.1, Scala 2, Python 3 in Glue version.

  • In the API, choose 3.0 in the GlueVersion parameter in the CreateJob API.

To view Spark event logs of AWS Glue 3.0, launch an upgraded Spark history server for Glue 3.0 using CloudFormation or Docker.

Migration check list

Review this checklist for migration.

  • Does your job depend on HDFS? If yes, try replacing HDFS with S3.

    • Search the file system path starting with hdfs:// or / as DFS path in the job script code.

    • Check if your default file system is not configured with HDFS. If it is configured explicitly, you need to remove the fs.defaultFS configuration.

    • Check if your job contains any dfs.* parameters. If it contains any, you need to verify it is okay to disable the parameters.

  • Does your job depend on YARN? If yes, verify the impacts by checking if your job contains the following parameters. If it contains any, you need to verify it is okay to disable the parameters.

    • spark.yarn.*

      For example:

      spark.yarn.executor.memoryOverhead spark.yarn.driver.memoryOverhead spark.yarn.scheduler.reporterThread.maxFailures
    • yarn.*

      For example:

      yarn.scheduler.maximum-allocation-mb yarn.nodemanager.resource.memory-mb
  • Does your job depend on Spark 2.2.1 or Spark 2.4.3? If yes, verify the impacts by checking if your job uses features changed in Spark 3.1.1.

    • https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-22-to-23

      For example the percentile_approx function, or the SparkSession with SparkSession.builder.getOrCreate() when there is an existing SparkContext.

    • https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-23-to-24

      For example the array_contains function, or the CURRENT_DATE, CURRENT_TIMESTAMP function with spark.sql.caseSensitive=true.

  • Do your job's extra jars conflict in Glue 3.0?

    • From AWS Glue 0.9/1.0: Extra jars supplied in existing AWS Glue 0.9/1.0 jobs may bring in classpath conflicts due to upgraded or new dependencies available in Glue 3.0. You can avoid classpath conflicts in AWS Glue 3.0 with the --user-jars-first AWS Glue job parameter or by shading your dependencies.

    • From AWS Glue 2.0: You can still avoid classpath conflicts in AWS Glue 3.0 with the --user-jars-first AWS Glue job parameter or by shading your dependencies.

  • Do your jobs depend on Scala 2.11?

    • AWS Glue 3.0 uses Scala 2.12 so you need to rebuild your libraries with Scala 2.12 if your libraries depend on Scala 2.11.

  • Do your job's external Python libraries depend on Python 2.7/3.6?

    • Use the --additional-python-modules parameters instead of setting the egg/wheel/zip file in the Python library path.

    • Update the dependent libraries from Python 2.7/3.6 to Python 3.7 as Spark 3.1.1 removed Python 2.7 support.

Migrating from AWS Glue 0.9 to AWS Glue 3.0

Note the following changes when migrating:

  • AWS Glue 0.9 uses open-source Spark 2.2.1 and AWS Glue 3.0 uses EMR-optimized Spark 3.1.1.

    • Several Spark changes alone may require revision of your scripts to ensure removed features are not being referenced.

    • For example, Spark 3.1.1 does not enable Scala-untyped UDFs but Spark 2.2 does allow them.

  • All jobs in AWS Glue 3.0 will be executed with significantly improved startup times. Spark jobs will be billed in 1-second increments with a 10x lower minimum billing duration since startup latency will go from 10 minutes maximum to 1 minute maximum.

  • Logging behavior has changed since AWS Glue 2.0.

  • Several dependency updates, highlighted in Appendix A: notable dependency upgrades.

  • Scala is also updated to 2.12 from 2.11, and Scala 2.12 is not backwards compatible with Scala 2.11.

  • Python 3.7 is also the default version used for Python scripts, as AWS Glue 0.9 was only utilizing Python 2.

    • Python 2.7 is not supported with Spark 3.1.1.

    • A new mechanism of installing additional Python modules is available.

  • AWS Glue 3.0 does not run on Apache YARN, so YARN settings do not apply.

  • AWS Glue 3.0 does not have a Hadoop Distributed File System (HDFS).

  • Any extra jars supplied in existing AWS Glue 0.9 jobs may bring in conflicting dependencies since there were upgrades in several dependencies in 3.0 from 0.9. You can avoid classpath conflicts in AWS Glue 3.0 with the --user-jars-first AWS Glue job parameter.

  • AWS Glue 3.0 does not yet support dynamic allocation, hence the ExecutorAllocationManager metrics are not available.

  • In AWS Glue version 3.0 jobs, you specify the number of workers and worker type, but do not specify a maxCapacity.

  • AWS Glue 3.0 does not yet support machine learning transforms.

  • AWS Glue 3.0 does not yet support development endpoints.

Refer to the Spark migration documentation:

Migrating from AWS Glue 1.0 to AWS Glue 3.0

Note the following changes when migrating:

  • AWS Glue 1.0 uses open-source Spark 2.4 and AWS Glue 3.0 uses EMR-optimized Spark 3.1.1.

    • Several Spark changes alone may require revision of your scripts to ensure removed features are not being referenced.

    • For example, Spark 3.1.1 does not enable Scala-untyped UDFs but Spark 2.4 does allow them.

  • All jobs in AWS Glue 3.0 will be executed with significantly improved startup times. Spark jobs will be billed in 1-second increments with a 10x lower minimum billing duration since startup latency will go from 10 minutes maximum to 1 minute maximum.

  • Logging behavior has changed since AWS Glue 2.0.

  • Several dependency updates, highlighted in

  • Scala is also updated to 2.12 from 2.11, and Scala 2.12 is not backwards compatible with Scala 2.11.

  • Python 3.7 is also the default version used for Python scripts, as AWS Glue 0.9 was only utilizing Python 2.

    • Python 2.7 is not supported with Spark 3.1.1.

    • A new mechanism of installing additional Python modules is available.

  • AWS Glue 3.0 does not run on Apache YARN, so YARN settings do not apply.

  • AWS Glue 3.0 does not have a Hadoop Distributed File System (HDFS).

  • Any extra jars supplied in existing AWS Glue 1.0 jobs may bring in conflicting dependencies since there were upgrades in several dependencies in 3.0 from 1.0. You can avoid classpath conflicts in AWS Glue 3.0 with the --user-jars-first AWS Glue job parameter.

  • AWS Glue 3.0 does not yet support dynamic allocation, hence the ExecutorAllocationManager metrics are not available.

  • In AWS Glue version 3.0 jobs, you specify the number of workers and worker type, but do not specify a maxCapacity.

  • AWS Glue 3.0 does not yet support machine learning transforms.

  • AWS Glue 3.0 does not yet support development endpoints.

Refer to the Spark migration documentation:

Migrating from AWS Glue 2.0 to AWS Glue 3.0

Note the following changes when migrating:

  • All existing job parameters and major features that exist in AWS Glue 2.0 will exist in AWS Glue 3.0.

    • The EMRFS S3-optimized committer for writing Parquet data into Amazon S3 is enabled by default in AWS Glue 3.0. However, you can still disable it by setting --enable-s3-parquet-optimized-committer to false.

  • AWS Glue 2.0 uses open-source Spark 2.4 and AWS Glue 3.0 uses EMR-optimized Spark 3.1.1.

    • Several Spark changes alone may require revision of your scripts to ensure removed features are not being referenced.

    • For example, Spark 3.1.1 does not enable Scala-untyped UDFs but Spark 2.4 does allow them.

  • AWS Glue 3.0 also features an update to EMRFS, updated JDBC drivers, and inclusions of additional optimizations onto Spark itself provided by AWS Glue.

  • All jobs in AWS Glue 3.0 will be executed with significantly improved startup times. Spark jobs will be billed in 1-second increments with a 10x lower minimum billing duration since startup latency will go from 10 minutes maximum to 1 minute maximum.

  • Python 2.7 is not supported with Spark 3.1.1.

  • Several dependency updates, highlighted in Appendix A: notable dependency upgrades.

  • Scala is also updated to 2.12 from 2.11, and Scala 2.12 is not backwards compatible with Scala 2.11.

  • Any extra jars supplied in existing AWS Glue 2.0 jobs may bring in conflicting dependencies since there were upgrades in several dependencies in 3.0 from 2.0. You can avoid classpath conflicts in AWS Glue 3.0 with the --user-jars-first AWS Glue job parameter.

  • AWS Glue 3.0 has different Spark task parallelism for driver/executor configuration compared to AWS Glue 2.0 and improves the performance and better utilizes the available resources. Both spark.driver.cores and spark.executor.cores are configured to number of cores on AWS Glue 3.0 (4 on the standard and G.1X worker, and 8 on the G.2X worker). These configurations do not change the worker type or hardware for the AWS Glue job. You can use these configurations to calculate the number of partitions or splits to match the Spark task parallelism in your Spark application.

    In general, jobs will see either similar or improved performance compared to AWS Glue 2.0. If jobs run slower, you can increase the task parallelism by passing the following job argument:

    • key: --executor-cores value: <desired number of tasks that can run in parallel>

    • The value should not exceed 2x the number of vCPUs on the worker type, which is 8 on G.1X, 16 on G.2X, 32 on G.4X and 64 on G.8X. You should exercise caution while updating this configuration as it could impact job performance because the increased parallelism causes memory and disk pressure, as well as it could throttle the source and target systems.

  • AWS Glue 3.0 uses Spark 3.1, which changes the behavior to loading/saving of timestamps from/to parquet files. For more details, see Upgrading from Spark SQL 3.0 to 3.1.

    We recommend to set the following parameters when reading/writing parquet data that contains timestamp columns. Setting those parameters can resolve the calendar incompatibility issue that occurs during the Spark 2 to Spark 3 upgrade, for both the AWS Glue Dynamic Frame and Spark Data Frame. Use the CORRECTED option to read the datetime value as it is; and the LEGACY option to rebase the datetime values with regard to the calendar difference during reading.

    - Key: --conf - Value: spark.sql.legacy.parquet.int96RebaseModeInRead=[CORRECTED|LEGACY] --conf spark.sql.legacy.parquet.int96RebaseModeInWrite=[CORRECTED|LEGACY] --conf spark.sql.legacy.parquet.datetimeRebaseModeInRead=[CORRECTED|LEGACY]

Refer to the Spark migration documentation:

Appendix A: notable dependency upgrades

The following are dependency upgrades:

Dependency Version in AWS Glue 0.9 Version in AWS Glue 1.0 Version in AWS Glue 2.0 Version in AWS Glue 3.0
Spark 2.2.1 2.4.3 2.4.3 3.1.1-amzn-0
Hadoop 2.7.3-amzn-6 2.8.5-amzn-1 2.8.5-amzn-5 3.2.1-amzn-3
Scala 2.11 2.11 2.11 2.12
Jackson 2.7.x 2.7.x 2.7.x 2.10.x
Hive 1.2 1.2 1.2 2.3.7-amzn-4
EMRFS 2.20.0 2.30.0 2.38.0 2.46.0
Json4s 3.2.x 3.5.x 3.5.x 3.6.6
Arrow N/A 0.10.0 0.10.0 2.0.0
AWS Glue Catalog client N/A N/A 1.10.0 3.0.0

Appendix B: JDBC driver upgrades

The following are JDBC driver upgrades:

Driver JDBC driver version in past AWS Glue versions JDBC driver version in AWS Glue 3.0
MySQL 5.1 8.0.23
Microsoft SQL Server 6.1.0 7.0.0
Oracle Databases 11.2 21.1
PostgreSQL 42.1.0 42.2.18
MongoDB 2.0.0 4.0.0