Menu
AWS Glue
Developer Guide

Introduction to Using Python with AWS Glue

AWS Glue uses an extension of the PySpark Python dialect for extract, transform, and load (ETL) jobs. This section describes how to set up your system to use Python, and how to call AWS Glue APIs from Python in ETL scripts.

You can find Python code examples and utilities for AWS Glue in the AWS Glue samples repository on the GitHub website.

To set up your system for using Python with AWS Glue

Follow these steps to install Python and to be able to invoke the AWS Glue APIs.

  1. If you don't already have Python 3 installed, download and install it from the Python.org download page.

  2. Install the AWS Command Line Interface (AWS CLI) as documented in the AWS CLI documentation.

    The AWS CLI is not directly necessary for using Python. However, installing and configuring it is a convenient way to set up AWS with your account credentials and verify that they work.

  3. Install the AWS SDK for Python (Boto 3), as documented in the Boto3 Quickstart.

    For more information about Boto 3, see AWS SDK for Python (Boto 3) Getting Started.

AWS Glue API Names in Python

Generic API names in the AWS Glue Developer Guide API Reference generally use CamelCase for use in Java and other programming languages. However, when called from Python, these generic names are changed to lowercase with the parts of the name separated by underscore characters ('_'). These more "Pythonic" names are listed in parentheses following the generic names in the AWS Glue API reference documentation.

Note

Boto 3 resource APIs are not yet available for AWS Glue. Currently, only the Boto 3 client APIs can be used.

Passing Python Parameters in AWS Glue

In Python calls to AWS Glue APIs, it's best to pass parameters explicitly by name. For example:


>> job = glue.create_job(name=‘sample’, role='Glue_DefaultRole',
                         command={'name': 'glueetl',
                                  'scriptLocation': 's3://my_script_bucket/scripts/my_etl_script.py'})

It might be helpful to understand that Python creates a dictionary of the parameter names/value tuples. Then Boto 3 passes them to AWS Glue in JSON format by way of a REST API call.

Example: Running an ETL Job

The following example shows how to use Python to create and run an ETL job in AWS Glue.

To create and run a job

  1. Create an instance of the AWS Glue client:

    
    >> import boto3
    >> glue = boto3.client(service_name='glue', region_name='us-east-1',
                           endpoint_url='https://glue.us-east-1.amazonaws.com')
    
  2. Create a job. You must use glueetl as the name for the ETL command, as shown in the following code:

    
     >> job = glue.create_job(name=‘sample’, role='Glue_DefaultRole',
                              command={'name': 'glueetl',
                                       'scriptLocation': 's3://my_script_bucket/scripts/my_etl_script.py'})
    
  3. Start a new run of the job that you created in the previous step:

    
    >> jobRun = glue.start_job_run(jobName=job['name'])
    
  4. Get the job status:

    
    >> status = glue.get_job_run(jobName=job['name'], runId=jobRun['jobRunId'])
    
  5. Print the current state of the job run:

    
    >> print status['jobRun']['jobRunState']