AWS Glue
Developer Guide

Document History for AWS Glue

The following table describes important changes to the documentation for AWS Glue.

  • Latest API version: 2019-12-09

  • Latest documentation update: December 9, 2019

Change Description Date

Various corrections and clarifications

Added corrections and clarifications throughout. Removed entries from the Known Issues chapter. Added warnings that AWS Glue supports only symmetrical customer master keys (CMKs) when specifying Data Catalog encryption settings and creating security configurations. Added a note that AWS Glue does not support writing to Amazon DynamoDB.

December 9, 2019

Support for custom JDBC drivers

Added information about connecting to data sources and targets with JDBC drivers that AWS Glue does not natively support, such as MySQL version 8 and Oracle Database version 18. For more information see JDBC connectionType Values.

November 25, 2019

Support for connecting Amazon SageMaker notebooks to different development endpoints

Added information about how you can connect an Amazon SageMaker notebook to different development endpoints. Updates to describe the new console action for switching to a new development endpoint, and the new Amazon SageMaker IAM policy. For more information, see Working with Notebooks on the AWS Glue Console and Create an IAM Policy for Amazon SageMaker Notebooks.

November 21, 2019

Support for Glue version in machine learning transforms

Added information about defining the Glue version in a machine learning transform to indicate the which version of AWS Glue a machine learning transform is compatible with. For more information see Working with Machine Learning Transforms on the AWS Glue Console.

November 21, 2019

Support for rewinding your job bookmarks

Added information about rewinding your job bookmarks to any previous job run, resulting in the subsequent job run reprocessing data only from the bookmarked job run. Described two new suboptions for the job-bookmark-pause option that allow you to run a job between two bookmarks. For more information, see Tracking Processed Data Using Job Bookmarks and Special Parameters Used by AWS Glue.

October 22, 2019

Support for custom JDBC certificates for connecting to a data store

Added information about AWS Glue support of custom JDBC certificates for SSL connections to AWS Glue data sources or targets. For more information, see Working with Connections on the AWS Glue Console.

October 10, 2019

Support for Python wheel

Added information about AWS Glue support of wheel files (along with egg files) as dependencies for Python shell jobs. For more information, see Providing Your Own Python Library.

September 26, 2019

Support for versioning of development endpoints in AWS Glue

Added information about defining the Glue version in development endpoints. Glue version determines the versions of Apache Spark and Python that AWS Glue supports. For more information, see Adding a Development Endpoint.

September 19, 2019

Support for monitoring AWS Glue using Spark UI

Added information about using Apache Spark UI to monitor and debug AWS Glue ETL jobs running on the AWS Glue job system, and Spark applications on AWS Glue development endpoints. For more information, see Monitoring AWS Glue Using Spark UI.

September 19, 2019

Enhancement of support for local ETL script development using the public AWS Glue ETL library

Updated the AWS Glue ETL library content to reflect that AWS Glue version 1.0 is now supported. For more information, see Developing and Testing ETL Scripts Locally Using the AWS Glue ETL Library.

September 18, 2019

Support for excluding Amazon S3 storage classes when running jobs

Added information about excluding Amazon S3 storage classes when running AWS Glue ETL jobs that read files or partitions from Amazon S3. For more information, see Excluding Amazon S3 Storage Classes.

August 29, 2019

Support for local ETL script development using the public AWS Glue ETL library

Added information about how to develop and test Python and Scala ETL scripts locally without the need for a network connection. For more information, see Developing and Testing ETL Scripts Locally Using the AWS Glue ETL Library.

August 28, 2019

Known Issues

Added information about known issues in AWS Glue. For more information, see Known Issues for AWS Glue.

August 28, 2019

Support for machine learning transforms in AWS Glue

Added information about machine learning capabilities provided by AWS Glue to create custom transforms. You can create these transforms when you create a job. For more information, see Machine Learning Transforms in AWS Glue.

August 8, 2019

Support for shared Amazon Virtual Private Cloud

Added information about AWS Glue support for shared Amazon Virtual Private Cloud. For more information, see Shared Amazon VPCs.

August 6, 2019

Support for versioning in AWS Glue

Added information about defining the Glue version in job properties. Glue version determines the versions of Apache Spark and Python that AWS Glue supports. For more information, see Adding Jobs in AWS Glue.

July 24, 2019

Support for additional configuration options for development endpoints

Added information about configuration options for development endpoints that have memory-intensive workloads. You can choose from two new configurations that provide more memory per executor. For more information, see Working with Development Endpoints on the AWS Glue Console.

July 24, 2019

Support for performing extract, transfer, and load (ETL) activities using workflows

Added information about using a new construct called a workflow to design a complex multi-job extract, transform, and load (ETL) activity that AWS Glue can execute and track as single entity. For more information, see Performing Complex ETL Activities Using Workflows in AWS Glue.

June 20, 2019

Support for Python 3.6 in Python shell jobs

Added information about support for Python 3.6 in Python shell jobs. You can specify either Python 2.7 or Python 3.6 as a job property. For more information, see Adding Python Shell Jobs in AWS Glue.

June 5, 2019

Support for virtual private cloud (VPC) endpoints

Added information about connecting directly to AWS Glue through an interface endpoint in your VPC. When you use a VPC interface endpoint, communication between your VPC and AWS Glue is conducted entirely and securely within the AWS network. For more information, see Using AWS Glue with VPC Endpoints.

June 4, 2019

Support for real-time, continuous logging for AWS Glue jobs.

Added information about enabling and viewing real-time Apache Spark job logs in CloudWatch including the driver logs, each of the executor logs, and a Spark job progress bar. For more information, see Continuous Logging for AWS Glue Jobs.

May 28, 2019

Support for existing Data Catalog tables as crawler sources

Added information about specifying a list of existing Data Catalog tables as crawler sources. Crawlers can then detect changes to table schemas, update table definitions, and register new partitions as new data becomes available. For more information, see Crawler Properties.

May 10, 2019

Support for additional configuration options for memory-intensive jobs

Added information about configuration options for Apache Spark jobs with memory-intensive workloads. You can choose from two new configurations that provide more memory per executor. For more information, see Adding Jobs in AWS Glue.

April 5, 2019

Support for CSV custom classifiers

Added information about using a custom CSV classifier to infer the schema of various types of CSV data. For more information, see Writing Custom Classifiers.

March 26, 2019

Support for AWS Resource Tags

Added information about using AWS resource tags to help you manage and control access to your AWS Glue resources. You can assign AWS resource tags to jobs, triggers, endpoints, and crawlers in AWS Glue. For more information, see AWS Tags in AWS Glue.

March 20, 2019

Support of AWS Glue Data Catalog for Spark SQL jobs

Added information about configuring your AWS Glue jobs and development endpoints to use the AWS Glue Data Catalog as an external Apache Hive Metastore. This allows jobs and development endpoints to directly run Apache Spark SQL queries against the tables stored in the AWS Glue Data Catalog. For more information, see AWS Glue Data Catalog Support for Spark SQL Jobs.

March 14, 2019

Support for Python shell jobs

Added information about Python shell jobs and the new field Maximum capacity. For more information, see Adding Python Shell Jobs in AWS Glue.

January 18, 2019

Support for notifications when there are changes to databases and tables

Added information about events that are generated for changes to database, table, and partition API calls. You can configure actions in CloudWatch Events to respond to these events. For more information, see Automating AWS Glue with CloudWatch Events.

January 16, 2019

Support for encrypting connection passwords

Added information about encrypting passwords used in connection objects. For more information, see Encrypting Connection Passwords.

December 11, 2018

Support for resource-level permission and resource-based policies

Added information about using resource-level permissions and resource-based policies with AWS Glue. For more information, see the topics within Security in AWS Glue.

October 15, 2018

Support for Amazon SageMaker notebooks

Added information about using Amazon SageMaker notebooks with AWS Glue development endpoints. For more information, see Managing Notebooks.

October 5, 2018

Support for encryption

Added information about using encryption with AWS Glue. For more information, see Encryption at Rest, Encryption in Transit, and Setting Up Encryption in AWS Glue.

August 24, 2018

Support for Apache Spark job metrics

Added information about the use of Apache Spark metrics for better debugging and profiling of ETL jobs. You can easily track runtime metrics such as bytes read and written, memory usage and CPU load of the driver and executors, and data shuffles among executors from the AWS Glue console. For more information, see Monitoring AWS Glue Using CloudWatch Metrics, Job Monitoring and Debugging, and Working with Jobs on the AWS Glue Console.

July 13, 2018

Support of DynamoDB as a data source

Added information about crawling DynamoDB and using it as a data source of ETL jobs. For more information, see Cataloging Tables with a Crawler and Connection Parameters.

July 10, 2018

Updates to create notebook server procedure

Updated information about how to create a notebook server on an Amazon EC2 instance associated with a development endpoint. For more information, see Creating a Notebook Server Associated with a Development Endpoint.

July 9, 2018

Updates now available over RSS

You can now subscribe to an RSS feed to receive notifications about updates to the AWS Glue Developer Guide.

June 25, 2018

Support delay notifications for jobs

Added information about configuring a delay threshold when a job runs. For more information, see Adding Jobs in AWS Glue.

May 25, 2018

Configure a crawler to append new columns

Added information about new configuration option for crawlers, MergeNewColumns. For more information, see Configuring a Crawler.

May 7, 2018

Support timeout of jobs

Added information about setting a timeout threshold when a job runs. For more information, see Adding Jobs in AWS Glue.

April 10, 2018

Support Scala ETL script and trigger jobs based on additional run states

Added information about using Scala as the ETL programming language. In addition, the trigger API now supports firing when any conditions are met (in addition to all conditions). Also, jobs can be triggered based on a "failed" or "stopped" job run (in addition to a "succeeded" job run).

January 12, 2018

Earlier Updates

The following table describes the important changes in each release of the AWS Glue Developer Guide before January 2018.

Change Description Date
Support XML data sources and new crawler configuration option Added information about classifying XML data sources and new crawler option for partition changes. November 16, 2017
New transforms, support for additional Amazon RDS database engines, and development endpoint enhancements Added information about the map and filter transforms, support for Amazon RDS Microsoft SQL Server, and Amazon RDS Oracle, and new features for development endpoints. September 29, 2017
AWS Glue initial release This is the initial release of the AWS Glue Developer Guide. August 14, 2017

On this page: