Running Spark ETL jobs with reduced startup times
AWS Glue versions 2.0 and later provide an upgraded infrastructure for running Apache Spark ETL (extract, transform, and load) jobs in AWS Glue with reduced startup times. With the reduced wait times, data engineers can be more productive and increase their interactivity with AWS Glue. The reduced variance in job start times can help you meet or exceed your SLAs of making data available for analytics.
To use this feature with your AWS Glue ETL jobs, choose 2.0
or a later version for the
Glue version
when creating your jobs.
New features supported
This section describes new features supported with AWS Glue versions 2.0 and later.
Support for specifying additional Python modules at the job level
AWS Glue versions 2.0 and later also let you provide additional Python modules or different versions at
the job level. You can use the --additional-python-modules
option with
a list of comma-separated Python modules to add a new module or change the version
of an existing module.
For example to update or to add a new scikit-learn
module use the following key/value: "--additional-python-modules", "scikit-learn==0.21.3"
.
Also, within the --additional-python-modules
option you can specify an Amazon S3 path to a Python wheel module. For example:
--additional-python-modules s3://aws-glue-native-spark/tests/j4.2/ephem-3.7.7.1-cp37-cp37m-linux_x86_64.whl,s3://aws-glue-native-spark/tests/j4.2/fbprophet-0.6-py3-none-any.whl,scikit-learn==0.21.3
AWS Glue uses the Python Package Installer (pip3) to install the additional modules. You can pass additional options specified by the python-modules-installer-option
to pip3 for installing the modules. Any incompatibility or limitations from pip3 will apply.
Python modules already provided in AWS Glue version 2.0
AWS Glue version 2.0 supports the following python modules out of the box:
setuptools—45.2.0
subprocess32—3.5.4
ptvsd—4.3.2
pydevd—1.9.0
PyMySQL—0.9.3
docutils—0.15.2
jmespath—0.9.4
six—1.14.0
python_dateutil—2.8.1
urllib3—1.25.8
botocore—1.15.4
s3transfer—0.3.3
boto3—1.12.4
certifi—2019.11.28
chardet—3.0.4
idna—2.9
requests—2.23.0
pyparsing—2.4.6
enum34—1.1.9
pytz—2019.3
numpy—1.18.1
cycler—0.10.0
kiwisolver—1.1.0
scipy—1.4.1
pandas—1.0.1
pyarrow—0.16.0
matplotlib—3.1.3
pyhocon—0.3.54
mpmath—1.1.0
sympy—1.5.1
patsy—0.5.1
statsmodels—0.11.1
fsspec—0.6.2
s3fs—0.4.0
Cython—0.29.15
joblib—0.14.1
pmdarima—1.5.3
scikit-learn—0.22.1
tbats—1.0.9
Logging behavior
AWS Glue versions 2.0 and later support different default logging behavior. The differences include:
Logging occurs in realtime.
There are separate streams for drivers and executors.
For each driver and executor there are two streams, the output stream and the error stream.
Driver and executor streams
Driver streams are identified by the job run ID. Executor streams are identified by the job <run id
>_<executor task id
>. For example:
"logStreamName": "jr_8255308b426fff1b4e09e00e0bd5612b1b4ec848d7884cebe61ed33a31789..._g-f65f617bd31d54bd94482af755b6cdf464542..."
Output and errors streams
The output stream has the standard output (stdout) from your code. The error stream has logging messages from the your code/library.
Log streams:
Driver log streams have <
jr
>, where <jr
> is the job run ID.Executor log streams have <
jr
>_<g
>, where <g
> is the task ID for the executor. You can look up the executor task ID in the driver error log.
The default log groups for AWS Glue version 2.0 are as follows:
/aws-glue/jobs/logs/output
for output/aws-glue/jobs/logs/error
for errors
When a security configuration is provided, the log group names change to:
/aws-glue/jobs/<
security configuration
>-role/<Role Name
>/output/aws-glue/jobs/<
security configuration
>-role/<Role Name
>/error
On the console the Logs link points to the output log group and the Error link points to the error log group. When continuous logging is enabled, the Logs links points to the continuous log group, and the Output link points to the output log group.
Logging rules
Note
The default log groupname for continuous logging is /aws-glue/jobs/logs-v2
.
In AWS Glue versions 2.0 and later, continuous logging has the same behavior as in AWS Glue version 1.0:
Default log group:
/aws-glue/jobs/logs-v2
Driver log stream: <
jr
>-driverExecutor log stream: <
jr
>-<executor ID
>The log group name can be changed by setting
--continuous-log-logGroupName
The log streams name can be prefixed by setting
--continous-log-logStreamPrefix
Features not supported
The following AWS Glue features are not supported:
Development endpoints
AWS Glue versions 2.0 and later do not run on Apache YARN, so YARN settings do not apply
AWS Glue versions 2.0 and later do not have a Hadoop Distributed File System (HDFS)
AWS Glue versions 2.0 and later do not use dynamic allocation, hence the ExecutorAllocationManager metrics are not available
For AWS Glue version 2.0 or later jobs, you specify the number of workers and worker type, but do not specify a
maxCapacity
.-
AWS Glue versions 2.0 and later do not support
s3n
out of the box. We recommend usings3
ors3a
. If jobs need to uses3n
for any reason, you can pass the following additional argument:--conf spark.hadoop.fs.s3n.impl=com.amazon.ws.emr.hadoop.fs.EmrFileSystem