Using Python Libraries with AWS Glue - AWS Glue

Using Python Libraries with AWS Glue

You can use Python extension modules and libraries with your AWS Glue ETL scripts as long as they are written in pure Python. C libraries such as pandas are not supported at the present time, nor are extensions written in other languages.

Zipping Libraries for Inclusion

Unless a library is contained in a single .py file, it should be packaged in a .zip archive. The package directory should be at the root of the archive, and must contain an file for the package. Python will then be able to import the package in the normal way.

If your library only consists of a single Python module in one .py file, you do not need to place it in a .zip file.

Loading Python Libraries in a Development Endpoint

If you are using different library sets for different ETL scripts, you can either set up a separate development endpoint for each set, or you can overwrite the library .zip file(s) that your development endpoint loads every time you switch scripts.

You can use the console to specify one or more library .zip files for a development endpoint when you create it. After assigning a name and an IAM role, choose Script Libraries and job parameters (optional) and enter the full Amazon S3 path to your library .zip file in the Python library path box. For example:


If you want, you can specify multiple full paths to files, separating them with commas but no spaces, like this:


If you update these .zip files later, you can use the console to re-import them into your development endpoint. Navigate to the developer endpoint in question, check the box beside it, and choose Update ETL libraries from the Action menu.

In a similar way, you can specify library files using the AWS Glue APIs. When you create a development endpoint by calling CreateDevEndpoint Action (Python: create_dev_endpoint), you can specify one or more full paths to libraries in the ExtraPythonLibsS3Path parameter, in a call that looks this:

dep = glue.create_dev_endpoint( EndpointName="testDevEndpoint", RoleArn="arn:aws:iam::123456789012", SecurityGroupIds="sg-7f5ad1ff", SubnetId="subnet-c12fdba4", PublicKey="ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCtp04H/y...", NumberOfNodes=3, ExtraPythonLibsS3Path="s3://bucket/prefix/,s3://bucket_B/prefix/")

When you update a development endpoint, you can also update the libraries it loads using a DevEndpointCustomLibraries object and setting the UpdateEtlLibraries parameter to True when calling UpdateDevEndpoint (update_dev_endpoint).

If you are using a Zeppelin Notebook with your development endpoint, you will need to call the following PySpark function before importing a package or packages from your .zip file:


Using Python Libraries in a Job or JobRun

When you are creating a new Job on the console, you can specify one or more library .zip files by choosing Script Libraries and job parameters (optional) and entering the full Amazon S3 library path(s) in the same way you would when creating a development endpoint:


If you are calling CreateJob (create_job), you can specify one or more full paths to default libraries using the --extra-py-files default parameter, like this:

job = glue.create_job(Name='sampleJob', Role='Glue_DefaultRole', Command={'Name': 'glueetl', 'ScriptLocation': 's3://my_script_bucket/scripts/'}, DefaultArguments={'--extra-py-files': 's3://bucket/prefix/,s3://bucket_B/prefix/'})

Then when you are starting a JobRun, you can override the default library setting with a different one:

runId = glue.start_job_run(JobName='sampleJob', Arguments={'--extra-py-files': 's3://bucket/prefix/'})

Specifying Additional Python Modules with AWS Glue Version 2.0

AWS Glue Version 2.0 also lets you provide additional Python modules or different versions at the job level. You can use the --additional-python-modules option with a list of comma-separated Python modules to add a new module or change the version of an existing module.

For more information, see Running Spark ETL Jobs with Reduced Startup Times.