AWS Glue
Developer Guide

Adding Jobs in AWS Glue

A job consists of the business logic that performs work in AWS Glue. Typically, a job runs extract, transform, and load (ETL) scripts. You can monitor job runs to understand runtime metrics such as success, duration, and start time. The output of a job is your transformed data, written to a location that you specify.

Job runs can be initiated by triggers that start a job when they fire. A job contains a script that connects to your source data, processes your data using the script's logic, and then writes it out to your data target. Your job can have multiple data sources and multiple data targets. You can use scripts that are generated by AWS Glue to transform data, or you can provide your own. The AWS Glue code generator can automatically create an Apache Spark API (PySpark) script given a source schema and target location or schema. You can use this script as a starting point and edit it to meet your goals.

AWS Glue can write output files in several data formats, including JSON, CSV, ORC (Optimized Row Columnar), Apache Parquet, and Apache Avro. For some data formats, common compression formats can be written.

There are two types of jobs in AWS Glue: Spark and Python shell.

  • An Apache Spark ETL job consists of the business logic that performs ETL work in AWS Glue. You can monitor job runs to understand runtime metrics such as success, duration, and start time. The output of a job is your transformed data, written to a location that you specify.

  • A Python shell job runs Python scripts as a shell. With a Python shell job, you can run scripts that are compatible with Python 2.7 or Python 3.6. You can use these jobs to schedule and run tasks that don't require Spark ETL jobs.

Defining Job Properties

When you define your job on the AWS Glue console, you provide values for properties to control the AWS Glue runtime environment. The following list describes some of the properties of a Spark job. For the properties of a Python shell job, see Defining Job Properties for Python Shell Jobs.

Name

Provide a UTF-8 string with a maximum length of 255 characters.

IAM role

Specify the IAM role that is used for authorization to resources used to run the job and access data stores. For more information about permissions for running jobs in AWS Glue, see Managing Access Permissions for AWS Glue Resources.

Type

Specify the type of job environment to run:

  • Choose Spark to run an Apache Spark ETL script with the job command named glueetl.

  • Choose Python shell to run a Python script with the job command named pythonshell. For more information, see Adding Python Shell Jobs in AWS Glue.

Glue version

Glue version determines the versions of Apache Spark and Python that AWS Glue supports. The Python version indicates the version supported for jobs of type Spark. The following table lists the available AWS Glue versions and corresponding Spark and Python versions.

Glue version Supported Spark and Python versions
Glue 0.9
  • Spark 2.2.1

  • Python 2.7

Glue 1.0
  • Spark 2.4.3

  • Python 2.7

  • Python 3.6

Jobs that were created without specifying a Glue version default to Glue 0.9.

Generated or custom script

The code in the ETL script defines your job's procedural logic. The script can be coded in Python or Scala. You can choose whether the script that the job runs is generated by AWS Glue or provided by you. You provide the script name and location in Amazon Simple Storage Service (Amazon S3). Confirm that there isn't a file with the same name as the script directory in the path. To learn more about using scripts, see Editing Scripts in AWS Glue.

Scala class name

If the script is coded in Scala, you must provide a class name. The default class name for AWS Glue generated scripts is GlueApp.

Temporary directory

Provide the location of a working directory in Amazon S3 where temporary intermediate results are written when AWS Glue runs the script. Confirm that there isn't a file with the same name as the temporary directory in the path. This directory is used when AWS Glue reads and writes to Amazon Redshift and by certain AWS Glue transforms.

Job bookmark

Specify how AWS Glue processes state information when the job runs. You can have it remember previously processed data, update state information, or ignore state information.

Job metrics

Enable or disable the creation of Amazon CloudWatch metrics when this job runs. To see profiling data, you must enable this option. For more information about how to enable and visualize metrics, see Job Monitoring and Debugging.

Tags

Tag your job with a Tag key and an optional Tag value. After tag keys are created, they are read-only. Use tags on some resources to help you organize and identify them. For more information, see AWS Tags in AWS Glue.

Server-side encryption

If you select this option, when the ETL job writes to Amazon S3, the data is encrypted at rest using SSE-S3 encryption. Both your Amazon S3 data target and any data that is written to an Amazon S3 temporary directory is encrypted. For more information, see Protecting Data Using Server-Side Encryption with Amazon S3-Managed Encryption Keys (SSE-S3).

Important

Currently, a security configuration overrides any server-side encryption (SSE-S3) setting passed as an ETL job parameter. Thus, if both a security configuration and an SSE-S3 parameter are associated with a job, the SSE-S3 parameter is ignored.

Script libraries

If your script requires it, you can specify locations for the following:

  • Python library path

  • Dependent jars path

  • Referenced files path

You can define the comma-separated Amazon S3 paths for these libraries when you define a job. You can override these paths when you run the job. For more information, see Providing Your Own Custom Scripts.

Worker type

The following worker types are available:

  • Standard – When you choose this type, you also provide a value for Maximum capacity. Maximum capacity is the number of AWS Glue data processing units (DPUs) that can be allocated when this job runs. A DPU is a relative measure of processing power that consists of 4 vCPUs of compute capacity and 16 GB of memory. The Standard worker type has a 50 GB disk and 2 executors.

  • G.1X – When you choose this type, you also provide a value for Number of workers. Each worker maps to 1 DPU (4 vCPU, 16 GB of memory, 64 GB disk), and provides 1 executor per worker. We recommend this worker type for memory-intensive jobs.

  • G.2X – When you choose this type, you also provide a value for Number of workers. Each worker maps to 2 DPU (8 vCPU, 32 GB of memory, 128 GB disk), and provides 1 executor per worker. We recommend this worker type for memory-intensive jobs and jobs that run ML transforms.

You are charged an hourly rate based on the number of DPUs used to run your ETL jobs. For more information, see the AWS Glue pricing page.

When you configure a job using the console and specify a Worker type of Standard, the Maximum capacity is set and the Number of workers becomes the value of Maximum capacity - 1. If you use the AWS Command Line Interface (AWS CLI) or AWS SDK, you can specify the Max capacity parameter, or you can specify both Worker type and the Number of workers. For more information, see Jobs.

Number of workers

The number of workers of a defined workerType that are allocated when a job runs.

With G.1X and G.2X Worker types, you must specify the number of workers of that type.

The maximum number of workers you can define are 299 for G.1X, and 149 for G.2X.

Maximum capacity

The maximum number of AWS Glue data processing units (DPUs) that can be allocated when this job runs. A DPU is a relative measure of processing power that consists of 4 vCPUs of compute capacity and 16 GB of memory. You are charged an hourly rate based on the number of DPUs used to run your ETL jobs. For more information, see the AWS Glue pricing page.

With the Standard Worker type, you must specify the maximum capacity of the job.

Choose an integer from 2 to 100. The default is 10. This job type cannot have a fractional DPU allocation.

Max concurrency

Sets the maximum number of concurrent runs that are allowed for this job. The default is 1. An error is returned when this threshold is reached. The maximum value you can specify is controlled by a service limit. For example, if a previous run of a job is still running when a new instance is started, you might want to return an error to prevent two instances of the same job from running concurrently.

Job timeout

Sets the maximum execution time in minutes. The default is 2880 minutes. If this limit is greater than the execution time, the job run state changes to “TIMEOUT”.

Delay notification threshold

Sets the threshold (in minutes) before a delay notification is sent. You can set this threshold to send notifications when a RUNNING, STARTING, or STOPPING job run takes more than an expected number of minutes.

Number of retries

Specify the number of times, from 0 to 10, that AWS Glue should automatically restart the job if it fails.

Job parameters

A set of key-value pairs that are passed as named parameters to the script invoked by the job. These are default values that are used when the script is run, but you can override them at run time. The key name is prefixed with --, for example --myKey and the value is value-for-myKey.

'--myKey' : 'value-for-myKey'

For more examples, see Python parameters in Passing and Accessing Python Parameters in AWS Glue.

Source

Specify a catalog table.

Target

Do one of the following:

  • To specify an Amazon S3 path or JDBC data store, choose Create tables in your data target.

  • To specify a catalog table, choose Use tables in the data catalog and update your data target.

For Amazon S3 target locations, provide the location of a directory where your output is written. Confirm that there isn't a file with the same name as the target path directory in the path. For JDBC targets, AWS Glue creates schema objects as needed if the specified objects do not exist.

Note

Source and target are not listed under the console Details tab for a job. Review the script to see source and target details.

Use Glue Data Catalog as the Hive metastore

Enables you to use the AWS Glue Data Catalog as a Spark Hive metastore.

For more information about adding a job using the AWS Glue console, see Working with Jobs on the AWS Glue Console.

On this page: