Running Spark ETL Jobs with Reduced Startup Times - AWS Glue

Running Spark ETL Jobs with Reduced Startup Times

AWS Glue version 2.0 provides an upgraded infrastructure for running Apache Spark ETL (extract, transform, and load) jobs in AWS Glue with reduced startup times. With the reduced wait times, data engineers can be more productive and increase their interactivity with AWS Glue. The reduced variance in job start times can help you meet or exceed your SLAs of making data available for analytics.

To use this feature with your AWS Glue ETL jobs, choose 2.0 for the Glue version when creating your jobs.

New Features Supported

This section describes new features supported with AWS Glue version 2.0.

Support for Specifying Additional Python Modules at the Job Level

AWS Glue version 2.0 also lets you provide additional Python modules or different versions at the job level. You can use the --additional-python-modules option with a list of comma-separated Python modules to add a new module or change the version of an existing module.

For example to update or to add a new scikit-learn module use the following key/value: "--additional-python-modules", "scikit-learn==0.21.3".

Also, within the --additional-python-modules option you can specify an Amazon S3 path to a Python wheel module. For example:

--additional-python-modules s3://aws-glue-native-spark/tests/j4.2/ephem-3.7.7.1-cp37-cp37m-linux_x86_64.whl,s3://aws-glue-native-spark/tests/j4.2/fbprophet-0.6-py3-none-any.whl,scikit-learn==0.21.3

AWS Glue uses the Python Package Installer (pip3) to install the additional modules. You can pass additional options specified by the python-modules-installer-option to pip3 for installing the modules. Any incompatibly or limitations from pip3 will apply.

Python Modules Already Provided in AWS Glue Version 2.0

AWS Glue version 2.0 supports the following python modules out of the box:

  • setuptools—45.2.0

  • setuptools—45.2.0

  • subprocess32—3.5.4

  • ptvsd—4.3.2

  • pydevd—1.9.0

  • PyMySQL—0.9.3

  • docutils—0.15.2

  • jmespath—0.9.4

  • six—1.14.0

  • python_dateutil—2.8.1

  • urllib3—1.25.8

  • botocore—1.15.4

  • s3transfer—0.3.3

  • boto3—1.12.4

  • certifi—2019.11.28

  • chardet—3.0.4

  • idna—2.9

  • requests—2.23.0

  • pyparsing—2.4.6

  • enum34—1.1.9

  • pytz—2019.3

  • numpy—1.18.1

  • cycler—0.10.0

  • kiwisolver—1.1.0

  • scipy—1.4.1

  • pandas—1.0.1

  • pyarrow—0.16.0

  • matplotlib—3.1.3

  • pyhocon—0.3.54

  • mpmath—1.1.0

  • sympy—1.5.1

  • patsy—0.5.1

  • statsmodels—0.11.1

  • fsspec—0.6.2

  • s3fs—0.4.0

  • Cython—0.29.15

  • joblib—0.14.1

  • pmdarima—1.5.3

  • scikit-learn—0.22.1

  • tbats—1.0.9

Logging Behavior

AWS Glue version 2.0 supports different default logging behavior. The differences include:

  • Logging occurs in realtime.

  • There are separate streams for drivers and executors.

  • For each driver and executor there are two streams, the output stream and the error stream.

Driver and Executor Streams

Driver streams are identified by the job run ID. Executor streams are identified by the job <run id>_<executor task id>. For example:

  • "logStreamName": "jr_8255308b426fff1b4e09e00e0bd5612b1b4ec848d7884cebe61ed33a31789..._g-f65f617bd31d54bd94482af755b6cdf464542..."

Output and Errors Streams

The output stream has the standard output (stdout) from your code. The error stream has logging messages from the your code/library.

  • Log streams:

    • Driver log streams have <jr>, where <jr> is the job run ID.

    • Executor log streams have <jr>_<g>, where <g> is the task ID for the executor. You can look up the executor task ID in the driver error log.

The default log groups for AWS Glue version 2.0 are as follows:

  • /aws-glue/jobs/logs/output for output

  • /aws-glue/jobs/logs/error for errors

When a security configuration is provided, the log group names change to:

  • /aws-glue/jobs/<security configuration>-role/<Role Name>/output

  • /aws-glue/jobs/<security configuration>-role/<Role Name>/error

On the console the Logs link points to the output log group and the Error link points to the error log group. When continuous logging is enabled, the Logs links points to the continuous log group, and the Output link points to the output log group.

Logging Rules

Note

In AWS Glue version 2.0, an issue with continuous logging requires that you create the log group before using the continuous logging feature. To work around this issue, you must create the default logging groups or customer log groups specified by --continuous-log-logGroup, otherwise the continuous log streams will not show up. AWS is working to address this issue.

The default log groupname for continuous logging is /aws-glue/jobs/logs-v2.

In AWS Glue version 2.0, continuous logging has the same behavior as in AWS Glue version 1.0:

  • Default log group: /aws-glue/jobs/logs-v2

  • Driver log stream: <jr>-driver

  • Executor log stream: <jr>-<executor ID>

    The log group name can be changed by setting --continuous-log-logGroupName

    The log streams name can be prefixed by setting --continous-log-logStreamPrefix

Features Not Supported

The following AWS Glue features are not supported:

  • Development endpoints

  • FindMatches machine learning transforms

  • AWS Glue streaming ETL jobs

  • AWS Glue version 2.0 does not run on Apache YARN, so YARN settings do not apply

  • AWS Glue version 2.0 does not have a Hadoop Distributed File System (HDFS)

  • AWS Glue version 2.0 does not use dynamic allocation, hence the ExecutorAllocationManager metrics are not available

  • For AWS Glue version 2.0 jobs, you specify the number of workers and worker type, but do not specify a maxCapacity.