AWS Glue
Developer Guide

Adding Jobs in AWS Glue

A job consists of the business logic that performs extract, transform, and load (ETL) work in AWS Glue. You can monitor job runs to understand runtime metrics such as success, duration, and start time. The output of a job is your transformed data, written to a location that you specify.

Job runs can be initiated by triggers that start a job when they fire. A job contains a script that connects to your source data, processes your data using the script's logic, and then writes it out to your data target. Your job can have multiple data sources and multiple data targets. You can use scripts that are generated by AWS Glue to transform data, or you can provide your own. The AWS Glue code generator can automatically create an Apache Spark API (PySpark) script given a source schema and target location or schema. You can use this script as a starting point and edit it to meet your goals.

AWS Glue can write output files in several data formats, including JSON, CSV, ORC (Optimized Row Columnar), Apache Parquet, and Apache Avro. For some data formats, common compression formats can be written.

Defining Job Properties

When you define your job in the AWS Glue console, you provide the following information to control the AWS Glue runtime environment:

IAM role

Specify the IAM role that is used for authorization to resources used to run the job and access data stores. For more information about permissions for running jobs in AWS Glue, see Overview of Managing Access Permissions to Your AWS Glue Resources.

Generated or custom script

The code in the ETL script defines your job's procedural logic. You can choose whether the script that the job runs is generated by AWS Glue or provided by you. You provide the script name and location in Amazon Simple Storage Service (Amazon S3). Confirm that there isn't a file with the same name as the script directory in the path. To learn more about using scripts, see Editing Scripts in AWS Glue.

Temporary directory

Provide the location of a working directory in Amazon S3 where temporary intermediate results are written when AWS Glue runs the script. Confirm that there isn't a file with the same name as the temporary directory in the path.

Job bookmark

Specify how AWS Glue processes state information when the job runs. You can have it remember previously processed data, update state information, or ignore state information.

Server-side encryption

If you select this option, when the ETL job writes to Amazon S3, the data is encrypted at rest using SSE-S3 encryption. Both your Amazon S3 data target and any data that is written to an Amazon S3 temporary directory is encrypted. For more information, see Protecting Data Using Server-Side Encryption with Amazon S3-Managed Encryption Keys (SSE-S3).

Script libraries

If your script requires it, you can specify locations for the following:

  • Python library path

  • Dependent jars path

  • Referenced files path

For more information, see Providing Your Own Custom Scripts.

Concurrent DPUs per job run

A data processing unit (DPU) is a relative measure of processing power that is used by a job. Choose an integer from 2 to 100. The default is 10. A single DPU provides processing capacity that consists of 4 vCPUs compute and 16 GB of memory.

Max concurrency

Sets the maximum number of concurrent runs that are allowed for this job. An error is returned when this threshold is reached. For example, if a previous run of a job is still running when a new instance is started, you might want to return an error to prevent two instances of the same job from running concurrently.

Number of retries

Specify the number of times, from 0 to 10, that AWS Glue should automatically restart the job if it fails.

Target path

For Amazon S3 target locations, provide the location of a directory in Amazon S3 where your output is written when AWS Glue runs the script. Confirm that there isn't a file with the same name as the target path directory in the path.

You also tell AWS Glue the default values for parameters that are required by the components of your job, such as transforms and the data stores that your job accesses.

For more information about adding a job using the AWS Glue console, see Working with Jobs on the AWS Glue Console.

On this page: