Python shell jobs in AWS Glue - AWS Glue

Python shell jobs in AWS Glue

You can use a Python shell job to run Python scripts as a shell in AWS Glue. With a Python shell job, you can run scripts that are compatible with Python 2.7, Python 3.6, or Python 3.9.

You can't use job bookmarks with Python shell jobs.

The Amazon CloudWatch Logs group for Python shell jobs output is /aws-glue/python-jobs/output. For errors, see the log group /aws-glue/python-jobs/errors.

Defining job properties for Python shell jobs

These sections describe defining job properties in AWS Glue Studio, or using the AWS CLI.

AWS Glue Studio

When you define your Python shell job in AWS Glue Studio, you provide some of the following properties:

IAM role

Specify the AWS Identity and Access Management (IAM) role that is used for authorization to resources that are used to run the job and access data stores. For more information about permissions for running jobs in AWS Glue, see Managing access permissions for AWS Glue resources.

Type

Choose Python shell to run a Python script with the job command named pythonshell.

Python version

Choose the Python version. The default is Python 3.6. Valid versions are Python 3.6 and Python 3.9.

Load common analytics libraries (Recommended)

Choose this option to include common libraries for Python 3.9 in the Python shell.

If your libraries are either custom or they conflict with the pre-installed ones, you can choose not to install common libraries. However, you can install additional libraries besides the common libraries.

When you select this option, the library-set option is set to analytics. When you de-select this option, the library-set option is set to none.

Script filename and Script path

The code in the script defines your job's procedural logic. You provide the script name and location in Amazon Simple Storage Service (Amazon S3). Confirm that there isn't a file with the same name as the script directory in the path. To learn more about using scripts, see Editing Spark scripts in the AWS Glue console.

Script

The code in the script defines your job's procedural logic. You can code the script in Python 3.6 or Python 3.9. You can edit a script in AWS Glue Studio.

Data processing units

The maximum number of AWS Glue data processing units (DPUs) that can be allocated when this job runs. A DPU is a relative measure of processing power that consists of 4 vCPUs of compute capacity and 16 GB of memory. For more information, see AWS Glue pricing.

You can set the value to 0.0625 or 1. The default is 0.0625.

CLI

You can also create a Python shell job using the AWS CLI, as in the following example.

aws glue create-job --name python-job-cli --role Glue_DefaultRole --command '{"Name" : "pythonshell", "PythonVersion": "3.9", "ScriptLocation" : "s3://aws-glue-scripts-123456789012-us-east-1/Admin/python-job-cli.py"}' --max-capacity 0.0625

Jobs that you create with the AWS CLI default to Python 3. Valid Python versions are 3 (corresponding to 3.6), and 3.9. To specify Python 3.6, add this tuple to the --command parameter: "PythonVersion":"3"

To specify Python 3.9, add this tuple to the --command parameter: "PythonVersion":"3.9"

To set the maximum capacity used by a Python shell job, provide the --max-capacity parameter. For Python shell jobs, the --allocated-capacity parameter can't be used.

Supported libraries for Python shell jobs

In Python shell using Python 3.9, you can choose the library set to use pre-packaged library sets for your needs. You can use the library-set option to choose the library set. Valid values are analytics, and none.

The environment for running a Python shell job supports the following libraries:

Python version Python 3.6 Python 3.9
Library set N/A analytics none
avro 1.11.0
awscli 116.242 1.23.5 1.23.5
awswrangler 2.15.1
botocore 1.12.232 1.23.5 1.23.5
boto3 1.9.203 1.22.5
elasticsearch 8.2.0
numpy 1.16.2 1.22.3
pandas 0.24.2 1.4.2
psycopg2 2.9.3
pyathena 2.5.3
PyGreSQL 5.0.6
PyMySQL 1.0.2
pyodbc 4.0.32
pyorc 0.6.0
redshift-connector 2.0.907
requests 2.22.0 2.27.1
scikit-learn 0.20.3 1.0.2
scipy 1.2.1 1.8.0
SQLAlchemy 1.4.36
s3fs 2022.3.0

You can use the NumPy library in a Python shell job for scientific computing. For more information, see NumPy. The following example shows a NumPy script that can be used in a Python shell job. The script prints "Hello world" and the results of several mathematical calculations.

import numpy as np print("Hello world") a = np.array([20,30,40,50]) print(a) b = np.arange( 4 ) print(b) c = a-b print(c) d = b**2 print(d)

Limitations

Note the following limitations on packaging your Python libraries:

  • Adding an .egg file is not supported in Python 3.9.

Providing your own Python library

Using PIP

Python shell using Python 3.9 lets you provide additional Python modules or different versions at the job level. You can use the --additional-python-modules option with a list of comma-separated Python modules to add a new module or change the version of an existing module.

For example to update or to add a new scikit-learn module use the following key/value: "--additional-python-modules", "scikit-learn==0.21.3".

Also, within the --additional-python-modules option you can specify an Amazon S3 path to a Python wheel module. For example:

--additional-python-modules s3://aws-glue-native-spark/tests/j4.2/fbprophet-0.6-py3-none-any.whl,scikit-learn==0.21.3

AWS Glue uses the Python Package Installer (pip3) to install the additional modules. You can pass additional options specified by the python-modules-installer-option to pip3 for installing the modules. Any incompatiblies or limitations from pip3 apply.

Note

We recommend using libraries built for Python 3.9 in order to avoid possible incompatibilities in the future.

Using an Egg or Whl file

You might already have one or more Python libraries packaged as an .egg or a .whl file. If so, you can specify them to your job using the AWS Command Line Interface (AWS CLI) under the "—extra-py-files" flag, as in the following example.

aws glue create-job --name python-redshift-test-cli --role role --command '{"Name" : "pythonshell", "ScriptLocation" : "s3://MyBucket/python/library/redshift_test.py"}' --connections Connections=connection-name --default-arguments '{"--extra-py-files" : ["s3://MyBucket/python/library/redshift_module-0.1-py2.7.egg", "s3://MyBucket/python/library/redshift_module-0.1-py2.7-none-any.whl"]}'

If you aren't sure how to create an .egg or a .whl file from a Python library, use the following steps. This example is applicable on macOS, Linux, and Windows Subsystem for Linux (WSL).

To create a Python .egg or .whl file

  1. Create an Amazon Redshift cluster in a virtual private cloud (VPC), and add some data to a table.

  2. Create an AWS Glue connection for the VPC-SecurityGroup-Subnet combination that you used to create the cluster. Test that the connection is successful.

  3. Create a directory named redshift_example, and create a file named setup.py. Paste the following code into setup.py.

    from setuptools import setup setup( name="redshift_module", version="0.1", packages=['redshift_module'] )
  4. In the redshift_example directory, create a redshift_module directory. In the redshift_module directory, create the files __init__.py and pygresql_redshift_common.py.

  5. Leave the __init__.py file empty. In pygresql_redshift_common.py, paste the following code. Replace port, db_name, user, and password_for_user with details specific to your Amazon Redshift cluster. Replace table_name with the name of the table in Amazon Redshift.

    import pg def get_connection(host): rs_conn_string = "host=%s port=%s dbname=%s user=%s password=%s" % ( host, port, db_name, user, password_for_user) rs_conn = pg.connect(dbname=rs_conn_string) rs_conn.query("set statement_timeout = 1200000") return rs_conn def query(con): statement = "Select * from table_name;" res = con.query(statement) return res
  6. If you're not already there, change to the redshift_example directory.

  7. Do one of the following:

    • To create an .egg file, run the following command.

      python setup.py bdist_egg
    • To create a .whl file, run the following command.

      python setup.py bdist_wheel
  8. Install the dependencies that are required for the preceding command.

  9. The command creates a file in the dist directory:

    • If you created an egg file, it's named redshift_module-0.1-py2.7.egg.

    • If you created a wheel file, it's named redshift_module-0.1-py2.7-none-any.whl.

    Upload this file to Amazon S3.

    In this example, the uploaded file path is either s3://MyBucket/python/library/redshift_module-0.1-py2.7.egg or s3://MyBucket/python/library/redshift_module-0.1-py2.7-none-any.whl.

  10. Create a Python file to be used as a script for the AWS Glue job, and add the following code to the file.

    from redshift_module import pygresql_redshift_common as rs_common con1 = rs_common.get_connection(redshift_endpoint) res = rs_common.query(con1) print "Rows in the table cities are: " print res
  11. Upload the preceding file to Amazon S3. In this example, the uploaded file path is s3://MyBucket/python/library/redshift_test.py.

  12. Create a Python shell job using this script. On the AWS Glue console, on the Job properties page, specify the path to the .egg/.whl file in the Python library path box. If you have multiple .egg/.whl files and Python files, provide a comma-separated list in this box.

    When modifying or renaming .egg files, the file names must use the default names generated by the "python setup.py bdist_egg" command or must adhere to the Python module naming conventions. For more information, see the Style Guide for Python Code.

    Using the AWS CLI, create a job with a command, as in the following example.

    aws glue create-job --name python-redshift-test-cli --role Role --command '{"Name" : "pythonshell", "ScriptLocation" : "s3://MyBucket/python/library/redshift_test.py"}' --connections Connections="connection-name" --default-arguments '{"--extra-py-files" : ["s3://MyBucket/python/library/redshift_module-0.1-py2.7.egg", "s3://MyBucket/python/library/redshift_module-0.1-py2.7-none-any.whl"]}'

    When the job runs, the script prints the rows created in the table_name table in the Amazon Redshift cluster.