AWS Glue
Developer Guide

Running and Monitoring Your Data Warehouse

You can automate the running of your data warehouse. AWS Glue provides metrics about crawlers and jobs that you can monitor. After you set up the AWS Glue Data Catalog with the metadata that is required to run extract, transform, and load (ETL) jobs, AWS Glue provides statistics about the health of your environment. You can automate the invocation of crawlers and jobs with a time-based schedule based on cron. You can also trigger jobs when an event-based trigger fires.

The main objective of AWS Glue is to provide an easy way to extract and transform your data from source to target. To accomplish this objective, an ETL job follows these typical steps:

  1. A trigger fires to initiate a job run. This event can be set up as a recurring schedule or to satisfy a dependency.

  2. The job extracts data from your source. If required, connection properties are used to access your source.

  3. The job transforms your data using a script that you created and the values of any arguments. The script contains the PySpark Python code that transforms your data.

  4. The transformed data is loaded to your data targets. If required, connection properties are used to access the target.

  5. Statistics are collected about the job run and are written to your Data Catalog.

The following diagram shows the ETL workflow containing these five steps.

      Dataflow showing extract, transform, and load in AWS Glue in 5 basic steps.