AWS Glue
Developer Guide

Adding Python Shell Jobs in AWS Glue

You can use a Python shell job to run Python scripts as a shell in AWS Glue. With a Python shell job, you can run scripts that are compatible with Python 2.7 or Python 3.6.

You can't use job bookmarks with Python shell jobs. Most of the other features that are available for Apache Spark jobs are also available for Python shell jobs.

The Amazon CloudWatch Logs group for Python shell jobs output is /aws-glue/python-jobs/output. For errors, see the log group /aws-glue/python-jobs/errors.

Defining Job Properties for Python Shell Jobs

When you define your Python shell job on the console (see Working with Jobs on the AWS Glue Console), you provide some of the following properties:

IAM role

Specify the AWS Identity and Access Management (IAM) role that is used for authorization to resources that are used to run the job and access data stores. For more information about permissions for running jobs in AWS Glue, see Managing Access Permissions for AWS Glue Resources.

Type

Choose Python shell to run a Python script with the job command named pythonshell.

Python version

Choose the Python version. The default is Python 3.

Custom script

The code in the script defines your job's procedural logic. You provide the script name and location in Amazon Simple Storage Service (Amazon S3). Confirm that there isn't a file with the same name as the script directory in the path. To learn more about using scripts, see Editing Scripts in AWS Glue.

An existing or new script

The code in the script defines your job's procedural logic. You can code the script in Python 2.7 or Python 3.6. You can edit a script on the AWS Glue console, but it is not generated by AWS Glue.

Maximum capacity

The maximum number of AWS Glue data processing units (DPUs) that can be allocated when this job runs. A DPU is a relative measure of processing power that consists of 4 vCPUs of compute capacity and 16 GB of memory. For more information, see AWS Glue pricing.

You can set the value to 0.0625 or 1. The default is 0.0625.

For descriptions of additional properties, see Defining Job Properties. For more information about adding a job using the AWS Glue console, see Working with Jobs on the AWS Glue Console.

You can also create a Python shell job using the AWS CLI, as in the following example.

aws glue create-job --name python-job-cli --role Glue_DefaultRole --command '{"Name" : "pythonshell", "ScriptLocation" : "s3://aws-glue-scripts-123456789012-us-east-1/Admin/python-job-cli.py"}'

Jobs that you create with the AWS CLI default to Python 2. To specify Python 3, add this tuple to the --command parameter:

"PythonVersion":"3"

To set the maximum capacity used by a Python shell job, provide the --max-capacity parameter. For Python shell jobs, the --allocated-capacity parameter can't be used.

Supported Libraries for Python Shell Jobs

The environment for running a Python shell job supports the following libraries:

  • Boto3

  • collections

  • CSV

  • gzip

  • multiprocessing

  • NumPy

  • pandas (required to be installed via the python setuptools configuration, setup.py)

  • pickle

  • PyGreSQL

  • re

  • SciPy

  • sklearn

  • sklearn.feature_extraction

  • sklearn.preprocessing

  • xml.etree.ElementTree

  • zipfile

You can use the NumPy library in a Python shell job for scientific computing. For more information, see NumPy. The following example shows a NumPy script that can be used in a Python shell job. The script prints "Hello world" and the results of several mathematical calculations.

import numpy as np print("Hello world") a = np.array([20,30,40,50]) print(a) b = np.arange( 4 ) print(b) c = a-b print(c) d = b**2 print(d)

Limitations

Note the following limitations on packaging your Python libraries:

  • Creating an .egg file on Windows 10 Pro using Python 3.7 is not supported.

  • Creating an .egg file on WSL (Windows Linux Subsystem, hosted by Windows 10 Pro) using Python 3.6 is supported.

Providing Your Own Python Library

You might already have one or more Python libraries packaged as an .egg or a .whl file. If so, you can specify them to your job using the AWS Command Line Interface (AWS CLI) under the "—extra-py-files" flag, as in the following example.

aws glue create-job --name python-redshift-test-cli --role role --command '{"Name" : "pythonshell", "ScriptLocation" : "s3://MyBucket/python/library/redshift_test.py"}' --connections Connections=connection-name --default-arguments '{"--extra-py-files" : ["s3://MyBucket/python/library/redshift_module-0.1-py2.7.egg", "s3://MyBucket/python/library/redshift_module-0.1-py2.7-none-any.whl"]}'

If you aren't sure how to create an .egg or a .whl file from a Python library, use the following steps. This example is applicable on macOS, Linux, and Windows Subsystem for Linux (WSL).

To create a Python .egg or .whl file

  1. Create an Amazon Redshift cluster in a virtual private cloud (VPC), and add some data to a table.

  2. Create an AWS Glue connection for the VPC-SecurityGroup-Subnet combination that you used to create the cluster. Test that the connection is successful.

  3. Create a directory named redshift_example, and create a file named setup.py. Paste the following code into setup.py.

    from setuptools import setup setup( name="redshift_module", version="0.1", packages=['redshift_module'] )
  4. In the redshift_example directory, create a redshift_module directory. In the redshift_module directory, create the files __init__.py and pygresql_redshift_common.py.

  5. Leave the __init__.py file empty. In pygresql_redshift_common.py, paste the following code. Replace port, db_name, user, and password_for_user with details specific to your Amazon Redshift cluster. Replace table_name with the name of the table in Amazon Redshift.

    import pg def get_connection(host): rs_conn_string = "host=%s port=%s dbname=%s user=%s password=%s" % ( host, port, db_name, user, password_for_user) rs_conn = pg.connect(dbname=rs_conn_string) rs_conn.query("set statement_timeout = 1200000") return rs_conn def query(con): statement = "Select * from table_name;" res = con.query(statement) return res
  6. If you're not already there, change to the redshift_module directory.

  7. Do one of the following:

    • To create an .egg file, run the following command.

      python setup.py bdist_egg
    • To create a .whl file, run the following command.

      python setup.py bdist_wheel
  8. Install the dependencies that are required for the preceding command.

  9. The command creates a file in the dist directory:

    • If you created an egg file, it's named redshift_module-0.1-py2.7.egg.

    • If you created a wheel file, it's named redshift_module-0.1-py2.7-none-any.whl.

    Upload this file to Amazon S3.

    In this example, the uploaded file path is either s3://MyBucket/python/library/redshift_module-0.1-py2.7.egg or s3://MyBucket/python/library/redshift_module-0.1-py2.7-none-any.whl.

  10. Create a Python file to be used as a script for the AWS Glue job, and add the following code to the file.

    from redshift_module import pygresql_redshift_common as rs_common con1 = rs_common.get_connection(redshift_endpoint) res = rs_common.query(con1) print "Rows in the table cities are: " print res
  11. Upload the preceding file to Amazon S3. In this example, the uploaded file path is s3://MyBucket/python/library/redshift_test.py.

  12. Create a Python shell job using this script. On the AWS Glue console, on the Job properties page, specify the path to the .egg/.whl file in the Python library path box. If you have multiple .egg/.whl files and Python files, provide a comma-separated list in this box.

    Using the AWS CLI, create a job with a command, as in the following example.

    aws glue create-job --name python-redshift-test-cli --role Role --command '{"Name" : "pythonshell", "ScriptLocation" : "s3://MyBucket/python/library/redshift_test.py"}' --connections Connections="connection-name" --default-arguments '{"--extra-py-files" : ["s3://MyBucket/python/library/redshift_module-0.1-py2.7.egg", "s3://MyBucket/python/library/redshift_module-0.1-py2.7-none-any.whl"]}'

    When the job runs, the script prints the rows created in the table_name table in the Amazon Redshift cluster.