AWS Glue
Developer Guide

The AWS Documentation website is getting a new look!
Try it now and let us know what you think. Switch to the new look >>

You can return to the original look by selecting English in the language selector above.

Adding Python Shell Jobs in AWS Glue

You can use a Python shell job to run Python scripts as a shell. With a Python shell job, you can run scripts that are compatible with Python 2.7 or Python 3.6.

You can't use job bookmarks with Python shell jobs. Most of the other features that are available for Spark jobs are also available for Python shell jobs.

The CloudWatch Logs group for Python shell jobs output is /aws-glue/python-jobs/output. For errors, see the log group /aws-glue/python-jobs/errors.

Defining Job Properties for Python Shell Jobs

When you define your Python shell job on the console (see Working with Jobs on the AWS Glue Console), you provide some of the following properties:

IAM role

Specify the IAM role that is used for authorization to resources used to run the job and access data stores. For more information about permissions for running jobs in AWS Glue, see Managing Access Permissions for AWS Glue Resources.

Type

Choose Python shell to run a Python script with the job command named pythonshell.

Python version

Choose the Python version. The default is Python 3.

Custom script

The code in the script defines your job's procedural logic. You provide the script name and location in Amazon Simple Storage Service (Amazon S3). Confirm that there isn't a file with the same name as the script directory in the path. To learn more about using scripts, see Editing Scripts in AWS Glue.

An existing or new script

The code in the script defines your job's procedural logic. The script can be coded in Python 2.7 or Python 3.6. You can edit your script on the AWS Glue console, but it is not generated by AWS Glue.

Maximum capacity

The maximum number of AWS Glue data processing units (DPUs) that can be allocated when this job runs. A DPU is a relative measure of processing power that consists of 4 vCPUs of compute capacity and 16 GB of memory. For more information, see the AWS Glue pricing page.

You can set the value to 0.0625 or 1. The default is 0.0625.

For descriptions of addition properties, see Defining Job Properties. For more information about adding a job using the AWS Glue console, see Working with Jobs on the AWS Glue Console.

You can also create a Python shell job using the AWS CLI; for example:

aws glue create-job --name python-job-cli --role Glue_DefaultRole --command '{"Name" : "pythonshell", "ScriptLocation" : "s3://aws-glue-scripts-123456789012-us-east-1/Admin/python-job-cli.py"}'

Jobs created with the AWS CLI default to Python 2. To specify Python 3, add this tuple to the --command parameter:

"PythonVersion":"3"

To set the maximum capacity used by a Python shell job, provide the --max-capacity parameter. For Python shell jobs, the --allocated-capacity parameter can't be used.

Supported Libraries for Python Shell Jobs

The environment for running a Python shell job supports the following libraries:

  • Boto3

  • collections

  • CSV

  • gzip

  • multiprocessing

  • NumPy

  • pandas

  • pickle

  • re

  • SciPy

  • sklearn

  • sklearn.feature_extraction

  • sklearn.preprocessing

  • xml.etree.ElementTree

  • zipfile

The NumPy library can be used in a Python shell job for scientific computing. For more information, see NumPy. The following example shows a NumPy script that can be used in a Python shell job. The script prints "Hello world" and the results of several mathematical calculations.

import numpy as np print("Hello world") a = np.array([20,30,40,50]) print a b = np.arange( 4 ) print b c = a-b print c d = b**2 print d

Providing Your Own Python Library

You can create .egg files that contain Python libraries and pass them as a Python library path parameter to an AWS Glue job. For example, to create a Python shell job to work with an Amazon Redshift cluster, follow these steps:

  1. Create an Amazon Redshift cluster in a virtual private cloud (VPC), and add some data to a table.

  2. Create an AWS Glue connection for the VPC-SecurityGroup-Subnet combination that you used to create the cluster. Test that the connection is successful.

  3. Create a directory named redshift_example, and create a file named setup.py. Paste the following code in setup.py:

    from setuptools import setup setup( name="redshift_module", version="0.1", packages=['redshift_module'] )
  4. In the redshift_example directory, create a redshift_module directory. In the redshift_module directory, create the files __init__.py and pygresql_redshift_common.py.

  5. Leave the __init__.py file empty. In pygresql_redshift_common.py, paste the following code:

    import pg def get_connection(host): rs_conn_string = "host=%s port=%s dbname=%s user=%s password=%s" % ( host, port, db_name, user, password_for_user) rs_conn = pg.connect(dbname=rs_conn_string) rs_conn.query("set statement_timeout = 1200000") return rs_conn def query(con): statement = "Select * from table_name;" res = con.query(statement) return res

    In the preceding code, replace port, db_name, user, and password_for_user with details specific to your Amazon Redshift cluster. Replace table_name with the name of the table in Amazon Redshift.

  6. If you're not already there, change to the redshift_module directory. Run the following command to create an .egg file:

    python setup.py bdist_egg
  7. Install the dependencies that are required for the preceding command.

  8. The command creates a file named redshift_module-0.1-py2.7.egg in the dist directory. Upload this file to Amazon S3.

    In this example, the path is s3://MyBucket/python/library/redshift_module-0.1-py2.7.egg after the file is uploaded.

  9. Create a Python file to be used as a script for the AWS Glue job, and add the following code to the file:

    from redshift_module import pygresql_redshift_common as rs_common con1 = rs_common.get_connection(redshift_endpoint) res = rs_common.query(con1) print "Rows in the table cities are: " print res
  10. Upload the preceding file to Amazon S3. In this example, the path is s3://MyBucket/python/library/redshift_test.py after the file is uploaded.

  11. Create a Python shell job using this script. On the AWS Glue console, on the Job properties page, specify the path to the .egg file in the Python library path box. If you have multiple .egg files and Python files, provide a comma-separated list in this box.

    Using the AWS CLI, you can create a job with a command; for example:

    aws glue create-job --name python-redshift-test-cli --role Role --command '{"Name" : "pythonshell", "ScriptLocation" : "s3://MyBucket/python/library/redshift_test.py"}' --connections Connections="connection-name" --default-arguments '{"--extra-py-files" : "s3://MyBucket/python/library/redshift_module-0.1-py2.7.egg"}'

    When the job runs, the script prints the rows created in the table table_name in the Amazon Redshift cluster.