Installing additional Python modules with pip in AWS Glue 2.0 or later Best Practices for Installing additional Python libraries in AWS Glue Including Python files with PySpark native features Programming scripts that use visual transforms Zipping libraries Libraries in notebooks Libraries in DevEndpoints Libraries in jobs Analyze Python dependencies Python modules already provided in AWS Glue

Using Python libraries with AWS Glue

You can install additional Python modules and libraries for use with AWS Glue ETL. For AWS Glue 2.0 and above, AWS Glue uses the Python Package Installer (pip3) to install additional modules used by AWS Glue ETL. AWS Glue provides multiple options to bring the additional Python modules to your AWS Glue job environment. You can use the “—additional-python-modules" parameter to bring in modules using Python wheel files, Requirements file (requirement.txt, AWS Glue 5.0 and above), or a list of comma-separated Python modules.

Topics

Installing additional Python modules with pip in AWS Glue 2.0 or later
Best Practices for Installing additional Python libraries in AWS Glue
Including Python files with PySpark native features
Programming scripts that use visual transforms
Zipping libraries for inclusion
Loading Python libraries in AWS Glue Studio notebooks
Loading Python libraries in a development endpoint in AWS Glue 0.9/1.0
Using Python libraries in a job or JobRun
Proactively analyze Python dependencies
Python modules already provided in AWS Glue

Glue Version Compatibility and Installation Methods
Glue version	Python version	Base image	glibc version	Supported Installation Methods
5.0	3.11	Amazon Linux 2023 (AL2023)	2.34	(Recommended) Packaging Python environment into a single wheel file Installing additional Python libraries in AWS Glue 5.0 or above using requirements.txt Installing additional Python libraries using Wheel Installing additional Python modules with pip in AWS Glue 2.0 or later
4.0	3.10	Amazon Linux 2 (AL2)	2.26	Best Practices for Installing additional Python libraries in AWS Glue
3.0	3.7	Amazon Linux 2 (AL2)	2.26	Best Practices for Installing additional Python libraries in AWS Glue
2.0	3.7	Amazon Linux AMI (AL1)	2.17	Best Practices for Installing additional Python libraries in AWS Glue
1.0	3.6	Amazon Linux AMI (AL1)	2.17	Best Practices for Installing additional Python libraries in AWS Glue
0.9	2.7	Amazon Linux AMI (AL1)	2.17	Best Practices for Installing additional Python libraries in AWS Glue

Under the AWS shared responsibility model, you are responsible for the management of additional Python modules, libraries, and their dependencies that you use with your AWS Glue ETL jobs. This includes applying updates and security patches.

AWS Glue does not support compiling native code in the job environment. However, AWS Glue jobs run within an Amazon-managed Linux environment. You may be able to provide your native dependencies in a compiled form through a Python wheel file. Please refer to above table for AWS Glue version compatibility details.

If your Python dependencies transitively depend on native, compiled code, you may run against the following limitation: AWS Glue does not support compiling native code in the job environment. However, AWS Glue jobs run within an Amazon-managed Linux environment. You may be able to provide your native dependencies in a compiled form through a wheel distribution. Please refer to above table for AWS Glue version compatibility details.

Important

Using incompatible dependencies can result in runtime issues, particularly for libraries with native extensions that must match the target environment's architecture and system libraries. Each AWS Glue version runs on a specific Python version with pre-installed libraries and system configurations.

Installing additional Python modules with pip in AWS Glue 2.0 or later

AWS Glue uses the Python Package Installer (pip3) to install additional modules to be used by AWS Glue ETL. You can use the --additional-python-modules parameter with a list of comma-separated Python modules to add a new module or change the version of an existing module. You can install custom distributions of a library by uploading the distribution to Amazon S3, then include the path to the Amazon S3 object in your list of modules.

You can pass additional options to pip3 with the --python-modules-installer-option parameter. For example, you could pass "--upgrade" to upgrade the packages specified by "--additional-python-modules". For more examples, see Building Python modules from a wheel for Spark ETL workloads using AWS Glue 2.0 .

AWS Glue supports installing custom Python packages using wheel (.whl) files stored in Amazon S3. To include wheel files in your AWS Glue jobs, provide a comma-separated list of your wheel files stored in s3 to the --additional-python-modules job parameter. For example:


--additional-python-modules s3://amzn-s3-demo-bucket/path/to/package-1.0.0-py3-none-any.whl,s3://your-bucket/path/to/another-package-2.1.0-cp311-cp311-linux_x86_64.whl

This approach also supports when you need custom distributions, or packages with native dependencies that are pre-compiled for the correct operating system. For more examples, see Building Python modules from a wheel for Spark ETL workloads using AWS Glue 2.0.

You specify the --additional-python-modules in the Job parameters field of the AWS Glue console or by altering the job arguments in the AWS SDK. For more information about setting job parameters, see Using job parameters in AWS Glue jobs.

In AWS Glue 5.0, you can provide the defacto-standard requirements.txt to manage Python library dependencies. To do that, provide following two job parameters:

Key: --python-modules-installer-option

Value: -r
Key: --additional-python-modules

Value: s3://path_to_requirements.txt

AWS Glue 5.0 nodes initially load python libraries specified in requirements.txt.

Here's a sample requirements.txt:


awswrangler==3.9.1
elasticsearch==8.15.1
PyAthena==3.9.0
PyMySQL==1.1.1
PyYAML==6.0.2
pyodbc==5.2.0
pyorc==0.9.0
redshift-connector==2.1.3
scipy==1.14.1
scikit-learn==1.5.2
SQLAlchemy==2.0.36

Important

Please avoid unpinned library versions in your requirements.txt to ensure you have a reliable and deterministic AWS Glue environment for your jobs.

When you use wheel for direct dependencies, you can bring in incompatible version of your transitive dependencies if they are not pinned correctly. As a best practice, all library versions should be pinned for consistency in AWS Glue jobs. AWS Glue recommends packaging your python environment into a wheel file to ensure consistency and reliability for your production workload.

To update or to add a new Python module AWS Glue allows passing --additional-python-modules parameter with a list of comma-separated Python modules as values. For example to update/ add scikit-learn module use the following key/value: "--additional-python-modules", "scikit-learn==0.21.3". You have two options to directly configure the python modules.

Pinned Python Module (recommended)

"--additional-python-modules", "scikit-learn==0.21.3,ephem==4.1.6"
Unpinned Python Module: (Not recommended for Production Workloads)

"--additional-python-modules", "scikit-learn>==0.20.0,ephem>=4.0.0"

OR

"--additional-python-modules", "scikit-learn,ephem"

Important

When configuring the python modules directly in --additional-python-modules, AWS Glue recommends to use pinned library versions to ensure consistency in AWS Glue job environment. Using unpinned library versions, pulls the latest version of the python modules, however this can introduce breaking changes or bring in incompatible python module leading to job failure due to python installation failure in AWS Glue job environment. We recommend customers to not use unpinned library versions for production workload. As a best practice, AWS Glue recommends packaging your python environment into a wheel file to ensure consistency and reliability for your production workload.

Best Practices for Installing additional Python libraries in AWS Glue

(Recommended) Packaging Python environment into a single wheel file

For a safe and consistent environment, AWS Glue recommends that you snapshot and package your python environment into a wheel file. The benefit of this is that your python environment for reference python modules and its transitive dependencies will be locked. This ensure that your AWS Glue job is not affected when upstream repository like PyPI or dependencies introduces an incompatible updates.

This file can then be used in your AWS Glue job using the --additional-python-modules flag.

Important

You must run the following script in a similar environment as the AWS Glue version you are running. Please refer to the glue environment details table and ensure you are using the same base OS image and python version.


#!/bin/bash
set -e
REQUIREMENTS_FILE="requirements.txt"
FINAL_WHEEL_OUTPUT_DIRECTORY="."
PACKAGE_NAME=$(basename "$(pwd)")
PACKAGE_VERSION="0.1.0"
# Help message
show_help() {
    echo "Usage: $0 [options]"
    echo ""
    echo "Options:"
    echo "  -r, --requirements FILE   Path to requirements.txt file (default: requirements.txt)"
    echo "  -o, --wheel-output DIR    Output directory for final wheel (default: current directory)"
    echo "  -n, --name NAME           Package name (default: current directory name)"
    echo "  -v, --version VERSION     Package version (default: 0.1.0)"
    echo "  -h, --help                Show this help message"
    echo "  -g, --glue-version        Glue version (required)"
    echo ""
    echo "Example:"
    echo "  $0 -r custom-requirements.txt -o dist -n my_package -v 1.2.3 -g 4.0"
}
# Parse command line arguments
while [[ $# -gt 0 ]]; do
    key="$1"
    case $key in
    -r | --requirements)
        REQUIREMENTS_FILE="$2"
        shift 2
        ;;
    -o | --wheel-output)
        FINAL_WHEEL_OUTPUT_DIRECTORY="$2"
        shift 2
        ;;
    -n | --name)
        PACKAGE_NAME="$2"
        shift 2
        ;;
    -v | --version)
        PACKAGE_VERSION="$2"
        shift 2
        ;;
    -g | --glue-version)
        GLUE_VERSION="$2"
        shift 2
        ;;
    -h | --help)
        show_help
        exit 0
        ;;
    *)
        echo "Unknown option: $1"
        show_help
        exit 1
        ;;
    esac
done
# If package name has dashes, convert to underscores and notify user. We need to check this since we cant import a package with dashes.
if [[ "$PACKAGE_NAME" =~ "-" ]]; then
    echo "Warning: Package name '$PACKAGE_NAME' contains dashes. Converting to underscores."
    PACKAGE_NAME=$(echo "$PACKAGE_NAME" | tr '-' '_')
fi
UBER_WHEEL_NAME="${PACKAGE_NAME}-${PACKAGE_VERSION}-py3-none-any.whl"
# Check if glue version is provided
if [ -z "$GLUE_VERSION" ]; then
    echo "Error: Glue version is required."
    exit 1
fi
# Validate version format (basic check)
if [[ ! "$PACKAGE_VERSION" =~ ^[0-9]+\.[0-9]+\.[0-9]+$ ]] && [[ ! "$PACKAGE_VERSION" =~ ^[0-9]+\.[0-9]+$ ]]; then
    echo "Warning: Version '$PACKAGE_VERSION' doesn't follow semantic versioning (x.y.z or x.y)"
fi
# Check if requirements file exists
if [ ! -f "$REQUIREMENTS_FILE" ]; then
    echo "Error: Requirements file '$REQUIREMENTS_FILE' not found."
    exit 1
fi
# Get relevant platform tags/python versions based on glue version
if [[ "$GLUE_VERSION" == "5.0" ]]; then
    PYTHON_VERSION="3.11"
    GLIBC_VERSION="2.34"
elif [[ "$GLUE_VERSION" == "4.0" ]]; then
    PYTHON_VERSION="3.10"
    GLIBC_VERSION="2.26"
elif [[ "$GLUE_VERSION" == "3.0" ]]; then
    PYTHON_VERSION="3.7"
    GLIBC_VERSION="2.26"
elif [[ "$GLUE_VERSION" == "2.0" ]]; then
    PYTHON_VERSION="3.7"
    GLIBC_VERSION="2.17"
elif [[ "$GLUE_VERSION" == "1.0" ]]; then
    PYTHON_VERSION="3.6"
    GLIBC_VERSION="2.17"
elif [[ "$GLUE_VERSION" == "0.9" ]]; then
    PYTHON_VERSION="2.7"
    GLIBC_VERSION="2.17"
else
    echo "Error: Unsupported glue version '$GLUE_VERSION'."
    exit 1
fi
echo "Using Glue version $GLUE_VERSION"
echo "Using Glue python version $PYTHON_VERSION"
echo "Using Glue glibc version $GLIBC_VERSION"
PIP_PLATFORM_FLAG=""
is_glibc_compatible() {
    # assumes glibc version in the form of major.minor (ex: 2.17)
    # glue glibc must be >= platform glibc
    local glue_glibc_version="$GLIBC_VERSION"
    local platform_glibc_version="$1"
    # 2.27 (platform) can run on 2.27 (glue)
    if [[ "$platform_glibc_version" == "$glue_glibc_version" ]]; then
        return 0
    fi
    local glue_glibc_major="${glue_glibc_version%%.*}"
    local glue_glibc_minor="${glue_glibc_version#*.}"
    local platform_glibc_major="${platform_glibc_version%%.*}"
    local platform_glibc_minor="${platform_glibc_version#*.}"
    # 3.27 (platform) cannot run on 2.27 (glue)
    if [[ "$platform_glibc_major" -gt "$glue_glibc_major" ]]; then
        return 1
    fi
    # 2.34 (platform) cannot run on 2.27 (glue)
    if [[ "$platform_glibc_major" -eq "$glue_glibc_major" ]] && [[ "$platform_glibc_minor" -gt "$glue_glibc_minor" ]]; then
        return 1
    fi
    # 2.17 (platform) can run on 2.27 (glue)
    return 0
}
PIP_PLATFORM_FLAG=""
if is_glibc_compatible "2.17"; then
    PIP_PLATFORM_FLAG="${PIP_PLATFORM_FLAG} --platform manylinux2014_x86_64"
fi
if is_glibc_compatible "2.28"; then
    PIP_PLATFORM_FLAG="${PIP_PLATFORM_FLAG} --platform manylinux_2_28_x86_64"
fi
if is_glibc_compatible "2.34"; then
    PIP_PLATFORM_FLAG="${PIP_PLATFORM_FLAG} --platform manylinux_2_34_x86_64"
fi
if is_glibc_compatible "2.39"; then
    PIP_PLATFORM_FLAG="${PIP_PLATFORM_FLAG} --platform manylinux_2_39_x86_64"
fi
echo "Using pip platform flags: $PIP_PLATFORM_FLAG"
# Convert to absolute paths
REQUIREMENTS_FILE=$(realpath "$REQUIREMENTS_FILE")
FINAL_WHEEL_OUTPUT_DIRECTORY=$(realpath "$FINAL_WHEEL_OUTPUT_DIRECTORY")
TEMP_WORKING_DIR=$(mktemp -d)
VENV_DIR="${TEMP_WORKING_DIR}/.build_venv"
WHEEL_OUTPUT_DIRECTORY="${TEMP_WORKING_DIR}/wheelhouse"
# Cleanup function
cleanup() {
    echo "Cleaning up temporary files..."
    rm -rf "$TEMP_WORKING_DIR"
}
trap cleanup EXIT
echo "========================================="
echo "Building wheel for $PACKAGE_NAME with all dependencies from $REQUIREMENTS_FILE"
echo "========================================="
# Determine Python executable to use consistently
PYTHON_EXEC=$(which python3 2>/dev/null || which python 2>/dev/null)
if [ -z "$PYTHON_EXEC" ]; then
    echo "Error: No Python executable found"
    exit 1
fi
echo "Using Python: $PYTHON_EXEC"
echo ""
# Install build requirements
echo "Step 1/5: Installing build tools..."
echo "----------------------------------------"
"$PYTHON_EXEC" -m pip install --upgrade pip build wheel setuptools
echo "✓ Build tools installed successfully"
echo ""
# Create a virtual environment for building
echo "Step 2/5: Creating build environment..."
echo "----------------------------------------"
"$PYTHON_EXEC" -m venv "$VENV_DIR"
# Check if virtual environment was created successfully
if [ ! -f "$VENV_DIR/bin/activate" ]; then
    echo "Error: Failed to create virtual environment"
    exit 1
fi
source "$VENV_DIR/bin/activate"
# Install pip-tools for dependency resolution
"$VENV_DIR/bin/pip" install pip-tools
echo "✓ Build environment created successfully"
echo ""
# Compile requirements to get all transitive dependencies
GLUE_PIP_ARGS="$PIP_PLATFORM_FLAG --python-version $PYTHON_VERSION --only-binary=:all:"
echo "Step 3/5: Resolving all dependencies..."
echo "----------------------------------------"
if ! "$VENV_DIR/bin/pip-compile" --pip-args "$GLUE_PIP_ARGS" --no-emit-index-url --output-file "$TEMP_WORKING_DIR/.compiled_requirements.txt" "$REQUIREMENTS_FILE"; then
    echo "Error: Failed to resolve dependencies. Check for conflicts in $REQUIREMENTS_FILE"
    exit 1
fi
echo "✓ Dependencies resolved successfully"
echo ""
# Download all wheels for dependencies
echo "Step 4/5: Downloading all dependency wheels..."
echo "----------------------------------------"
"$VENV_DIR/bin/pip" download -r "$TEMP_WORKING_DIR/.compiled_requirements.txt" -d "$WHEEL_OUTPUT_DIRECTORY" $GLUE_PIP_ARGS
# Check if any wheels were downloaded
if [ ! "$(ls -A "$WHEEL_OUTPUT_DIRECTORY")" ]; then
    echo "Error: No wheels were downloaded. Check your requirements file."
    exit 1
fi
# Count downloaded wheels (using find instead of ls for better handling)
WHEEL_COUNT=$(find "$WHEEL_OUTPUT_DIRECTORY" -name "*.whl" -type f | wc -l | tr -d ' ')
echo "✓ Downloaded $WHEEL_COUNT dependency wheels successfully"
echo ""
# Create a single uber wheel with all dependencies
echo "Step 5/5: Creating uber wheel with all dependencies included..."
echo "----------------------------------------"
# Create a temporary directory for the uber wheel
UBER_WHEEL_DIR="$TEMP_WORKING_DIR/uber"
mkdir -p "$UBER_WHEEL_DIR"
# Create the setup.py file with custom install command
cat >"$UBER_WHEEL_DIR/setup.py" <<EOF
from setuptools import setup, find_packages
import setuptools.command.install
import os
import glob
import subprocess
import sys
setup(
    name='${PACKAGE_NAME}',
    version='${PACKAGE_VERSION}',
    description='Bundle containing dependencies for ${PACKAGE_NAME}',
    author='Package Builder',
    author_email='builder@example.com',
    packages=['${PACKAGE_NAME}'],  # Include the package directory to hold wheels
    include_package_data=True,
    package_data={
        '${PACKAGE_NAME}': ['wheels/*.whl'],  # Include wheels in the package directory
    }
)
EOF
# Create a MANIFEST.in file to include all wheels
cat >"$UBER_WHEEL_DIR/MANIFEST.in" <<EOF
recursive-include ${PACKAGE_NAME}/wheels *.whl
EOF
# Create an __init__.py file that imports all the bundled wheel files (no auto-install logic)
mkdir -p "$UBER_WHEEL_DIR/${PACKAGE_NAME}"
cat >"$UBER_WHEEL_DIR/${PACKAGE_NAME}/__init__.py" <<EOF
"""
${PACKAGE_NAME} - dependencies can be installed at runtime using the $(load_wheels) function
"""
from pathlib import Path
import logging
import subprocess
import sys
__version__ = "${PACKAGE_VERSION}"

def load_wheels(log_level=logging.INFO):
    logger = logging.getLogger(__name__)
    handler = logging.StreamHandler(sys.stdout)
    formatter = logging.Formatter("[Glue Python Wheel Installer] %(asctime)s - %(name)s - %(levelname)s - %(message)s")
    handler.setFormatter(formatter)
    logger.addHandler(handler)
    logger.setLevel(log_level)
    logger.info("Starting wheel installation process")
    package_dir = Path(__file__).parent.absolute()
    wheels_dir = package_dir / "wheels"
    logger.debug(f"Package directory: {package_dir}")
    logger.debug(f"Looking for wheels in: {wheels_dir}")
    if not wheels_dir.exists():
        logger.error(f"Wheels directory not found: {wheels_dir}")
        return False
    wheel_files = list(wheels_dir.glob("*.whl"))
    if not wheel_files:
        logger.warning(f"No wheels found in: {wheels_dir}")
        return False
    logger.info(f"Found {len(wheel_files)} wheels")
    wheel_file_paths = [str(wheel_file) for wheel_file in wheel_files]
    logger.info(f"Installing {wheel_file_paths}...")
    try:
        result = subprocess.run(
            [sys.executable, "-m", "pip", "install", *wheel_file_paths], check=True, capture_output=True, text=True
        )
        logger.info(f"✓ Successfully installed wheel files")
        logger.debug(f"pip output: {result.stdout}")
    except subprocess.CalledProcessError as e:
        error_msg = f"Failed to install wheel files"
        logger.error(f"✗ {error_msg}: {e}")
        if e.stderr:
            logger.error(f"Error details: {e.stderr}")
        return False
    logger.info("All wheels installed successfully")
    return True
EOF
cat >"$UBER_WHEEL_DIR/${PACKAGE_NAME}/auto.py" <<EOF
"""
${PACKAGE_NAME} - utility module that allows users to automatically install modules by adding $(import ${PACKAGE_NAME}.auto) to the top of their script
"""
from ${PACKAGE_NAME} import load_wheels
load_wheels()
EOF
# Copy all wheels to the uber wheel directory
mkdir -p "$UBER_WHEEL_DIR/${PACKAGE_NAME}/wheels"
cp "$WHEEL_OUTPUT_DIRECTORY"/*.whl "$UBER_WHEEL_DIR/${PACKAGE_NAME}/wheels/"
# Build the uber wheel
echo "Building uber wheel package..."
# Install build tools in the current environment
"$VENV_DIR/bin/pip" install build
if ! (cd "$UBER_WHEEL_DIR" && "$VENV_DIR/bin/python" -m build --skip-dependency-check --wheel --outdir .); then
    echo "Error: Failed to build uber wheel"
    exit 1
fi
# Ensure output directory exists
mkdir -p "$FINAL_WHEEL_OUTPUT_DIRECTORY"
# Copy the uber wheel to the output directory
FINAL_WHEEL_OUTPUT_PATH="$FINAL_WHEEL_OUTPUT_DIRECTORY/$UBER_WHEEL_NAME"
# Find the generated wheel (should be only one in the root directory)
GENERATED_WHEEL=$(find "$UBER_WHEEL_DIR" -maxdepth 1 -name "*.whl" -type f | head -1)
if [ -z "$GENERATED_WHEEL" ]; then
    echo "Error: No uber wheel was generated"
    exit 1
fi
cp "$GENERATED_WHEEL" "$FINAL_WHEEL_OUTPUT_PATH"
# Get final wheel size for user feedback
WHEEL_SIZE=$(du -h "$FINAL_WHEEL_OUTPUT_PATH" | cut -f1)
echo "✓ Uber wheel created successfully!"
echo ""
echo "========================================="
echo "BUILD COMPLETED SUCCESSFULLY!"
echo "========================================="
echo "Final wheel: $FINAL_WHEEL_OUTPUT_PATH"
echo "Wheel size: $WHEEL_SIZE"
echo "Dependencies included: $WHEEL_COUNT packages"
echo ""
echo "To install the bundle, run:"
echo "  pip install $FINAL_WHEEL_OUTPUT_PATH"
echo ""
echo "After installation, you can verify that the bundle works by running:"
echo "  python -c \"import ${PACKAGE_NAME}; ${PACKAGE_NAME}.load_wheels()\""
echo "  or "
echo "  python -c \"import ${PACKAGE_NAME}.auto\""
echo "========================================="


./wheel_packager.sh -r <path to requirements.txt> -g <glue version> -o <wheel output directory> -n <package name> -v <wheel version>


--additional-python-modules s3://your-bucket/path/to/package_with_dependencies-1.0.0-py3-none-any.whl


# Option 1: automatic installation via import
import package_with_dependencies.auto
        
# Option 2: manual installation
from package_with_dependencies import load_wheels
load_wheels()

Including Python files with PySpark native features

AWS Glue uses PySpark to include Python files in AWS Glue ETL jobs. You will want to use --additional-python-modules to manage your dependencies when available. You can use the --extra-py-files job parameter to include Python files. Dependencies must be hosted in Amazon S3 and the argument value should be a comma delimited list of Amazon S3 paths with no spaces. This functionality behaves like the Python dependency management you would use with Spark. For more information on Python dependency management in Spark, see Using PySpark Native Features page in Apache Spark documentation. --extra-py-files is useful in cases where your additional code is not packaged, or when you are migrating a Spark program with an existing toolchain for managing dependencies. For your dependency tooling to be maintainable, you will have to bundle your dependencies before submitting.

Programming scripts that use visual transforms

When you create a AWS Glue job using the AWS Glue Studio visual interface, you can transform your data with managed data transform nodes and custom visual transforms. For more information about managed data transform nodes, see Transform data with AWS Glue managed transforms. For more information about custom visual transforms, see Transform data with custom visual transforms . Scripts using visual transforms can only be generated when your job Language is set to use Python.

When generating a AWS Glue job using visual transforms, AWS Glue Studio will include these transforms in the runtime environment using the --extra-py-files parameter in the job configuration. For more information about job parameters, see Using job parameters in AWS Glue jobs. When making changes to a generated script or runtime environment, you will need to preserve this job configuration for your script to run successfully.

Zipping libraries for inclusion

Unless a library is contained in a single .py file, it should be packaged in a .zip archive. The package directory should be at the root of the archive, and must contain an __init__.py file for the package. Python will then be able to import the package in the normal way.

If your library only consists of a single Python module in one .py file, you do not need to place it in a .zip file.

Loading Python libraries in AWS Glue Studio notebooks

To specify Python libraries in AWS Glue Studio notebooks, see Installing additional Python modules.

Loading Python libraries in a development endpoint in AWS Glue 0.9/1.0

If you are using different library sets for different ETL scripts, you can either set up a separate development endpoint for each set, or you can overwrite the library .zip file(s) that your development endpoint loads every time you switch scripts.

You can use the console to specify one or more library .zip files for a development endpoint when you create it. After assigning a name and an IAM role, choose Script Libraries and job parameters (optional) and enter the full Amazon S3 path to your library .zip file in the Python library path box. For example:


s3://bucket/prefix/site-packages.zip

If you want, you can specify multiple full paths to files, separating them with commas but no spaces, like this:


s3://bucket/prefix/lib_A.zip,s3://bucket_B/prefix/lib_X.zip

If you update these .zip files later, you can use the console to re-import them into your development endpoint. Navigate to the developer endpoint in question, check the box beside it, and choose Update ETL libraries from the Action menu.

In a similar way, you can specify library files using the AWS Glue APIs. When you create a development endpoint by calling CreateDevEndpoint action (Python: create_dev_endpoint), you can specify one or more full paths to libraries in the ExtraPythonLibsS3Path parameter, in a call that looks this:



dep = glue.create_dev_endpoint(
             EndpointName="testDevEndpoint",
             RoleArn="arn:aws:iam::123456789012",
             SecurityGroupIds="sg-7f5ad1ff",
             SubnetId="subnet-c12fdba4",
             PublicKey="ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCtp04H/y...",
             NumberOfNodes=3,
             ExtraPythonLibsS3Path="s3://bucket/prefix/lib_A.zip,s3://bucket_B/prefix/lib_X.zip")

When you update a development endpoint, you can also update the libraries it loads using a DevEndpointCustomLibraries object and setting the UpdateEtlLibraries parameter to True when calling UpdateDevEndpoint (update_dev_endpoint).

Using Python libraries in a job or JobRun

When you are creating a new Job on the console, you can specify one or more library .zip files by choosing Script Libraries and job parameters (optional) and entering the full Amazon S3 library path(s) in the same way you would when creating a development endpoint:


s3://bucket/prefix/lib_A.zip,s3://bucket_B/prefix/lib_X.zip

If you are calling CreateJob (create_job), you can specify one or more full paths to default libraries using the --extra-py-files default parameter, like this:


job = glue.create_job(Name='sampleJob',
                      Role='Glue_DefaultRole',
                      Command={'Name': 'glueetl',
                               'ScriptLocation': 's3://my_script_bucket/scripts/my_etl_script.py'},
                      DefaultArguments={'--extra-py-files': 's3://bucket/prefix/lib_A.zip,s3://bucket_B/prefix/lib_X.zip'})

Then when you are starting a JobRun, you can override the default library setting with a different one:


runId = glue.start_job_run(JobName='sampleJob',
                           Arguments={'--extra-py-files': 's3://bucket/prefix/lib_B.zip'})

Proactively analyze Python dependencies

To proactively identify potential dependency issues before deploying to AWS Glue, you can use the dependency analysis tool to validate your Python packages against your target AWS Glue environment.

AWS provides an open-source Python dependency analyzer tool specifically designed for AWS Glue environments. This tool is available in the AWS Glue samples repository and can be used locally to validate your dependencies before deployment.

This analysis helps ensure your dependencies follow the recommended practice of pinning all library versions for consistent production deployments. For more details, please see the tool's README .

The AWS Glue Python Dependency Analyzer helps identify unpinned dependencies and version conflicts by simulating pip installation with platform-specific constraints that match your target AWS Glue environment.


# Analyze a single Glue job
python glue_dependency_analyzer.py -j my-glue-job

# Analyze multiple jobs with specific AWS configuration
python glue_dependency_analyzer.py -j job1 -j job2 --aws-profile production --aws-region us-west-2

The tool will flag:

Unpinned dependencies that could install different versions across job runs
Version conflicts between packages
Dependencies not available for your target AWS Glue environment

Amazon Q Developer is a generative artificial intelligence (AI) powered conversational assistant that can help you understand, build, extend, and operate AWS applications. You can download it by following the instructions in the Getting started guide for Amazon Q.

Amazon Q Developer can be used to analyze and fix job failures due to python dependency. We suggest using the following prompt by replacing the job <Job-Name> placeholder with the name of your glue job.


I have an AWS Glue job named <Job-Name> that has failed due to Python module installation conflicts. Please assist in diagnosing and resolving this issue using the following systematic approach. Proceed once sufficient information is available.

Objective: Implement a fix that addresses the root cause module while minimizing disruption to the existing working environment.

Step 1: Root Cause Analysis
• Retrieve the most recent failed job run ID for the specified Glue job
• Extract error logs from CloudWatch Logs using the job run ID as a log stream prefix
• Analyze the logs to identify:
  • The recently added or modified Python module that triggered the dependency conflict
  • The specific dependency chain causing the installation failure
  • Version compatibility conflicts between required and existing modules

Step 2: Baseline Configuration Identification
• Locate the last successful job run ID prior to the dependency failure
• Document the Python module versions that were functioning correctly in that baseline run
• Establish the compatible version constraints for conflicting dependencies

Step 3: Targeted Resolution Implementation
• Apply pinning by updating the job's additional_python_modules parameter
• Pin only the root cause module and its directly conflicting dependencies to compatible versions, and do not remove python modules unless necessary
• Preserve flexibility for non-conflicting modules by avoiding unnecessary version constraints
• Deploy the configuration changes with minimal changes to the existing configuration and execute a validation test run. Do not change the Glue versions.

Implementation Example:
Scenario: Recently added pandas==2.0.0 to additional_python_modules
Error: numpy version conflict (pandas 2.0.0 requires numpy>=1.21, but existing job code requires numpy<1.20)
Resolution: Update additional_python_modules to "pandas==1.5.3,numpy==1.19.5"
Rationale: Use pandas 1.5.3 (compatible with numpy 1.19.5) and pin numpy to last known working version

Expected Outcome: Restore job functionality with minimal configuration changes while maintaining system stability.

The prompt instructs Q to:

Fetch the latest failed job run ID
Find associated logs and details
Find successful job runs to detect any changed Python packages
Make any configuration fixes and trigger another test run

Python modules already provided in AWS Glue

To change the version of these provided modules, provide new versions with the --additional-python-modules job parameter.

AWS Glue version 5.0

AWS Glue version 5.0 includes the following Python modules out of the box:

aiobotocore==2.13.1
aiohappyeyeballs==2.3.5
aiohttp==3.10.1
aioitertools==0.11.0
aiosignal==1.3.1
appdirs==1.4.4
attrs==24.2.0
boto3==1.34.131
botocore==1.34.131
certifi==2024.7.4
charset-normalizer==3.3.2
contourpy==1.2.1
cycler==0.12.1
fonttools==4.53.1
frozenlist==1.4.1
fsspec==2024.6.1
idna==2.10
jmespath==0.10.0
kaleido==0.2.1
kiwisolver==1.4.5
matplotlib==3.9.0
multidict==6.0.5
numpy==1.26.4
packaging==24.1
pandas==2.2.2
pillow==10.4.0
pip==23.0.1
plotly==5.23.0
pyarrow==17.0.0
pyparsing==3.1.2
python-dateutil==2.9.0.post0
pytz==2024.1
requests==2.32.2
s3fs==2024.6.1
s3transfer==0.10.2
seaborn==0.13.2
setuptools==59.6.0
six==1.16.0
tenacity==9.0.0
tzdata==2024.1
urllib3==1.25.10
virtualenv==20.4.0
wrapt==1.16.0
yarl==1.9.4

AWS Glue version 4.0

AWS Glue version 4.0 includes the following Python modules out of the box:

aiobotocore==2.4.1
aiohttp==3.8.3
aioitertools==0.11.0
aiosignal==1.3.1
async-timeout==4.0.2
asynctest==0.13.0
attrs==22.2.0
avro-python3==1.10.2
boto3==1.24.70
botocore==1.27.59
certifi==2021.5.30
chardet==3.0.4
charset-normalizer==2.1.1
click==8.1.3
cycler==0.10.0
Cython==0.29.32
fsspec==2021.8.1
idna==2.10
importlib-metadata==5.0.0
jmespath==0.10.0
joblib==1.0.1
kaleido==0.2.1
kiwisolver==1.4.4
matplotlib==3.4.3
mpmath==1.2.1
multidict==6.0.4
nltk==3.7
numpy==1.23.5
packaging==23.0
pandas==1.5.1
patsy==0.5.1
Pillow==9.4.0
pip==23.0.1
plotly==5.16.0
pmdarima==2.0.1
ptvsd==4.3.2
pyarrow==10.0.0
pydevd==2.5.0
pyhocon==0.3.58
PyMySQL==1.0.2
pyparsing==2.4.7
python-dateutil==2.8.2
pytz==2021.1
PyYAML==6.0.1
regex==2022.10.31
requests==2.23.0
s3fs==2022.11.0
s3transfer==0.6.0
scikit-learn==1.1.3
scipy==1.9.3
setuptools==49.1.3
six==1.16.0
statsmodels==0.13.5
subprocess32==3.5.4
sympy==1.8
tbats==1.1.0
threadpoolctl==3.1.0
tqdm==4.64.1
typing_extensions==4.4.0
urllib3==1.25.11
wheel==0.37.0
wrapt==1.14.1
yarl==1.8.2
zipp==3.10.0

AWS Glue version 3.0

AWS Glue version 3.0 includes the following Python modules out of the box:,

aiobotocore==1.4.2
aiohttp==3.8.3
aioitertools==0.11.0
aiosignal==1.3.1
async-timeout==4.0.2
asynctest==0.13.0
attrs==22.2.0
avro-python3==1.10.2
boto3==1.18.50
botocore==1.21.50
certifi==2021.5.30
chardet==3.0.4
charset-normalizer==2.1.1
click==8.1.3
cycler==0.10.0
Cython==0.29.4
docutils==0.17.1
enum34==1.1.10
frozenlist==1.3.3
fsspec==2021.8.1
idna==2.10
importlib-metadata==6.0.0
jmespath==0.10.0
joblib==1.0.1
kiwisolver==1.3.2
matplotlib==3.4.3
mpmath==1.2.1
multidict==6.0.4
nltk==3.6.3
numpy==1.19.5
packaging==23.0
pandas==1.3.2
patsy==0.5.1
Pillow==9.4.0
pip==23.0
pmdarima==1.8.2
ptvsd==4.3.2
pyarrow==5.0.0
pydevd==2.5.0
pyhocon==0.3.58
PyMySQL==1.0.2
pyparsing==2.4.7
python-dateutil==2.8.2
pytz==2021.1
PyYAML==5.4.1
regex==2022.10.31
requests==2.23.0
s3fs==2021.8.1
s3transfer==0.5.0
scikit-learn==0.24.2
scipy==1.7.1
six==1.16.0
Spark==1.0
statsmodels==0.12.2
subprocess32==3.5.4
sympy==1.8
tbats==1.1.0
threadpoolctl==3.1.0
tqdm==4.64.1
typing_extensions==4.4.0
urllib3==1.25.11
wheel==0.37.0
wrapt==1.14.1
yarl==1.8.2
zipp==3.12.0

AWS Glue version 2.0

AWS Glue version 2.0 includes the following Python modules out of the box:

avro-python3==1.10.0
awscli==1.27.60
boto3==1.12.4
botocore==1.15.4
certifi==2019.11.28
chardet==3.0.4
click==8.1.3
colorama==0.4.4
cycler==0.10.0
Cython==0.29.15
docutils==0.15.2
enum34==1.1.9
fsspec==0.6.2
idna==2.9
importlib-metadata==6.0.0
jmespath==0.9.4
joblib==0.14.1
kiwisolver==1.1.0
matplotlib==3.1.3
mpmath==1.1.0
nltk==3.5
numpy==1.18.1
pandas==1.0.1
patsy==0.5.1
pmdarima==1.5.3
ptvsd==4.3.2
pyarrow==0.16.0
pyasn1==0.4.8
pydevd==1.9.0
pyhocon==0.3.54
PyMySQL==0.9.3
pyparsing==2.4.6
python-dateutil==2.8.1
pytz==2019.3
PyYAML==5.3.1
regex==2022.10.31
requests==2.23.0
rsa==4.7.2
s3fs==0.4.0
s3transfer==0.3.3
scikit-learn==0.22.1
scipy==1.4.1
setuptools==45.2.0
six==1.14.0
Spark==1.0
statsmodels==0.11.1
subprocess32==3.5.4
sympy==1.5.1
tbats==1.0.9
tqdm==4.64.1
typing-extensions==4.4.0
urllib3==1.25.8
wheel==0.35.1
zipp==3.12.0

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Calling APIs

Python samples