AWS Glue versions - AWS Glue

AWS Glue versions

You can configure the AWS Glue version parameter when you add or update a job. The AWS Glue version determines the versions of Apache Spark and Python that AWS Glue supports. The Python version indicates the version that's supported for jobs of type Spark. The following table lists the available AWS Glue versions, the corresponding Spark and Python versions, and other changes in functionality.

AWS Glue versions

AWS Glue version Supported Spark and Python versions Changes in functionality
AWS Glue 4.0
  • Spark 3.3.0

  • Python 3.10

AWS Glue 4.0 is the latest version of AWS Glue. There are several optimizations and upgrades built into this AWS Glue release, such as:

  • Many Spark functionality upgrades from Spark 3.1 to Spark 3.3:

    • Several functionality improvements when paired with Pandas. For more information, see What's New in Spark 3.3.

    • Additional optimizations developed on Amazon EMR.

    • Upgrade to EMR File System (EMRFS) 2.53.

  • Log4j 2 migration from Log4j 1.x

  • Several Python module updates from AWS Glue 3.0, such as an upgraded version of Boto.

  • Upgrade of several connectors, including the default Amazon Redshift connector. See Appendix C: Connector upgrades.

  • Upgrade of several JDBC drivers. See Appendix B: JDBC driver upgrades.

  • Updated with a new Amazon Redshift connector and JDBC driver.

  • Native support for open-data lake frameworks with Apache Hudi, Delta Lake, and Apache Iceberg.

  • Native support for the Amazon S3-based Cloud Shuffle Storage Plugin (an Apache Spark plugin) to use Amazon S3 for shuffling and elastic storage capacity.

Limitations

The following are limitations with AWS Glue 4.0:

  • AWS Glue streaming jobs and AWS Glue interactive sessions are not yet available in AWS Glue 4.0.

  • AWS Glue machine learning and personally identifiable information (PII) transforms are not yet available in AWS Glue 4.0.

For more information about migrating to AWS Glue version 4.0, see Migrating AWS Glue jobs to AWS Glue version 4.0.

AWS Glue 3.0
  • Spark 3.1.1

  • Python 3.7

In addition to the Spark engine upgrade to 3.0, there are optimizations and upgrades built into this AWS Glue release, such as:

  • Builds the AWS Glue ETL Library against Spark 3.0, which is a major release for Spark.

  • Streaming jobs are supported on AWS Glue 3.0.

  • Includes new AWS Glue Spark runtime optimizations for performance and reliability:

    • Faster in-memory columnar processing based on Apache Arrow for reading CSV data.

    • SIMD-based execution for vectorized reads with CSV data.

    • Spark upgrade also includes additional optimizations developed on Amazon EMR.

    • Upgraded EMRFS from 2.38 to 2.46 enabling new features and bug fixes for Amazon S3 access.

  • Upgraded several dependencies that were required for the new Spark version. See Appendix A: notable dependency upgrades.

  • Upgraded JDBC drivers for our natively supported data sources. See Appendix B: JDBC driver upgrades.

Limitations

The following are limitations with AWS Glue 3.0:

  • AWS Glue machine learning transforms are not yet available in AWS Glue 3.0.

  • Some custom Spark connectors do not work with AWS Glue 3.0 if they depend on Spark 2.4 and do not have compatibility with Spark 3.1.

For more information about migrating to AWS Glue version 3.0, see Migrating AWS Glue jobs to AWS Glue version 3.0.

AWS Glue 2.0
  • Spark 2.4.3

  • Python 3.7

In addition to the features provided in AWS Glue version 1.0, AWS Glue version 2.0 also provides:

  • An upgraded infrastructure for running Apache Spark ETL jobs in AWS Glue with reduced startup times.

  • Default logging is now real time, with separate streams for drivers and executors, and outputs and errors.

  • Support for specifying additional Python modules or different versions at the job level.

Note

AWS Glue version 2.0 differs from AWS Glue version 1.0 for some dependencies and versions due to underlying architectural changes. Validate your AWS Glue jobs before migrating across major AWS Glue version releases.

For more information about AWS Glue version 2.0 features and limitations, see Running Spark ETL jobs with reduced startup times.

AWS Glue 1.0
  • Spark 2.4.3

  • Python 2.7

  • Python 3.6

You can maintain job bookmarks for Parquet and ORC formats in AWS Glue ETL jobs (using AWS Glue version 1.0). Previously, you were only able to bookmark common Amazon S3 source formats such as JSON, CSV, Apache Avro and XML in AWS Glue ETL jobs.

When setting format options for ETL inputs and outputs, you can specify to use Apache Avro reader/writer format 1.8 to support Avro logical type reading and writing (using AWS Glue version 1.0). Previously, only the version 1.7 Avro reader/writer format was supported.

The DynamoDB connection type supports a writer option (using AWS Glue version 1.0).

AWS Glue 0.9
  • Spark 2.2.1

  • Python 2.7

Jobs that were created without specifying an AWS Glue version default to AWS Glue 0.9.