Using Python libraries with AWS Glue
AWS Glue lets you install additional Python modules and libraries for use with AWS Glue ETL.
Topics
- Installing additional Python modules with pip in AWS Glue 2.0+
- Including Python files with PySpark native features
- Programming scripts that use visual transforms
- Python modules already provided in AWS Glue
- Zipping libraries for inclusion
- Loading Python libraries in AWS Glue Studio notebooks
- Loading Python libraries in a development endpoint
- Using Python libraries in a job or JobRun
Installing additional Python modules with pip in AWS Glue 2.0+
AWS Glue uses the Python Package Installer (pip3) to install additional modules to be used by
AWS Glue ETL. You can use the --additional-python-modules
parameter with a list of
comma-separated Python modules to add a new module or change the version of an existing module. You can install
custom distributions of a library by uploading the distribution to Amazon S3, then include the path to the Amazon S3 object
in your list of modules.
You can pass additional options to pip3 with the --python-modules-installer-option
parameter. For
example, you could pass "--upgrade"
to upgrade the packages specified by
"--additional-python-modules"
. For more examples, see Building Python modules from a wheel for Spark ETL workloads using AWS Glue 2.0
If your Python dependencies transitively depend on native, compiled code, you may run against the following limitation: AWS Glue does not support compiling native code in the job environment. However, AWS Glue jobs run within an Amazon Linux 2 environment. You may be able to provide your native dependencies in a compiled form through a Wheel distributable.
For example to update or to add a new scikit-learn
module use the following
key/value: "--additional-python-modules", "scikit-learn==0.21.3"
.
Also, within the --additional-python-modules
option you can specify an Amazon S3
path to a Python wheel module. For example:
--additional-python-modules s3://aws-glue-native-spark/tests/j4.2/ephem-3.7.7.1-cp37-cp37m-linux_x86_64.whl,s3://aws-glue-native-spark/tests/j4.2/fbprophet-0.6-py3-none-any.whl,scikit-learn==0.21.3
You specify the --additional-python-modules
in the Job parameters field of
the AWS Glue console or by altering the job arguments in the AWS SDK. For more information about
setting job parameters, see Using job parameters in AWS Glue jobs.
Including Python files with PySpark native features
AWS Glue uses PySpark to include Python files in AWS Glue ETL jobs. You will want to use
--additional-python-modules
to manage your dependencies when available. You can use the
--extra-py-files
job parameter to include Python files. Dependencies must be hosted in Amazon S3 and the
argument value should be a comma delimited list of Amazon S3 paths with no spaces. This functionality behaves like the
Python dependency management you would use with Spark. For more information on Python dependency management in
Spark, see Using PySpark Native Features--extra-py-files
is useful
in cases where your additional code is not packaged, or when you are migrating a Spark program with an existing
toolchain for managing dependencies. For your dependency tooling to be maintainable, you will
have to bundle your dependencies before submitting.
Programming scripts that use visual transforms
When you create a AWS Glue job using the AWS Glue Studio visual interface, you can transform your data with managed data transform nodes and custom visual transforms. For more information about managed data transform nodes, see Transform data with AWS Glue managed transforms. For more information about custom visual transforms, see Transform data with custom visual transforms . Scripts using visual transforms can only be generated when when your job Language is set to use Python.
When generating a AWS Glue job using visual transforms, AWS Glue Studio will include these transforms in the runtime
environment using the --extra-py-files
parameter in the job configuration. For more information about
job parameters, see Using job parameters in AWS Glue jobs. When making changes to a generated
script or runtime environment, you will need to preserve this job configuration for your script to run
successfully.
Python modules already provided in AWS Glue
To change the version of these provided modules, provide new versions with the --additional-python-modules
job parameter.
Zipping libraries for inclusion
Unless a library is contained in a single .py
file, it should be
packaged in a .zip
archive. The package directory should be at the root
of the archive, and must contain an __init__.py
file for the package.
Python will then be able to import the package in the normal way.
If your library only consists of a single Python module in one .py
file, you do not need to place it in a .zip
file.
Loading Python libraries in AWS Glue Studio notebooks
To specify Python libraries in AWS Glue Studio notebooks, see Installing additional Python modules .
Loading Python libraries in a development endpoint
If you are using different library sets for different ETL scripts,
you can either set up a separate development endpoint for each set,
or you can overwrite the library .zip
file(s) that your
development endpoint loads every time you switch scripts.
You can use the console to specify one or more library .zip files for
a development endpoint when you create it. After assigning a name and
an IAM role, choose Script Libraries and job parameters
(optional) and enter the full Amazon S3 path to your library
.zip
file in the Python library path box.
For example:
s3://bucket
/prefix/
site-packages.zip
If you want, you can specify multiple full paths to files, separating them with commas but no spaces, like this:
s3://bucket/prefix/lib_A.zip,s3://bucket_B/prefix/lib_X.zip
If you update these .zip
files later, you can use the console
to re-import them into your development endpoint. Navigate to the developer
endpoint in question, check the box beside it, and choose Update
ETL libraries from the Action menu.
In a similar way, you can specify library files using the AWS Glue APIs.
When you create a development endpoint by calling CreateDevEndpoint action (Python: create_dev_endpoint),
you can specify one or more full paths to libraries in the ExtraPythonLibsS3Path
parameter, in a call that looks this:
dep = glue.create_dev_endpoint(
EndpointName="testDevEndpoint
",
RoleArn="arn:aws:iam::123456789012
",
SecurityGroupIds="sg-7f5ad1ff
",
SubnetId="subnet-c12fdba4
",
PublicKey="ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCtp04H/y...
",
NumberOfNodes=3
,
ExtraPythonLibsS3Path="s3://bucket/prefix/lib_A.zip,s3://bucket_B/prefix/lib_X.zip
")
When you update a development endpoint, you can also update the libraries it loads
using a DevEndpointCustomLibraries object
and setting the UpdateEtlLibraries
parameter to True
when calling UpdateDevEndpoint (update_dev_endpoint).
Using Python libraries in a job or JobRun
When you are creating a new Job on the console, you can specify one or more library .zip files by choosing Script Libraries and job parameters (optional) and entering the full Amazon S3 library path(s) in the same way you would when creating a development endpoint:
s3://bucket/prefix/lib_A.zip,s3://bucket_B/prefix/lib_X.zip
If you are calling CreateJob (create_job),
you can specify one or more full paths to default libraries using the --extra-py-files
default parameter, like this:
job = glue.create_job(Name='sampleJob
',
Role='Glue_DefaultRole
',
Command={'Name': 'glueetl',
'ScriptLocation': 's3://my_script_bucket/scripts/my_etl_script.py
'},
DefaultArguments={'--extra-py-files': 's3://bucket/prefix/lib_A.zip,s3://bucket_B/prefix/lib_X.zip
'})
Then when you are starting a JobRun, you can override the default library setting with a different one:
runId = glue.start_job_run(JobName='sampleJob
',
Arguments={'--extra-py-files': 's3://bucket/prefix/lib_B.zip
'})