Menu
AWS Glue
Developer Guide

AWS Glue Components

AWS Glue provides a console and API operations to set up and manage your extract, transform, and load (ETL) workload. You can use API operations through several language-specific SDKs and the AWS Command Line Interface (AWS CLI). For information about using the AWS CLI, see AWS Command Line Interface Reference.

AWS Glue uses the AWS Glue Data Catalog to store metadata about data sources, transforms, and targets. The Data Catalog is a drop-in replacement for the Apache Hive Metastore. The AWS Glue Jobs system provides a managed infrastructure for defining, scheduling, and running ETL operations on your data. For more information about the AWS Glue API, see AWS Glue API.

AWS Glue Console

You use the AWS Glue console to define and orchestrate your ETL workflow. The console calls several API operations in the AWS Glue Data Catalog and AWS Glue Jobs system to perform the following tasks:

  • Define AWS Glue objects such as jobs, tables, crawlers, and connections.

  • Schedule when crawlers run.

  • Define events or schedules for job triggers.

  • Search and filter lists of AWS Glue objects.

  • Edit transformation scripts.

AWS Glue Data Catalog

The AWS Glue Data Catalog is your persistent metadata store. It is a managed service that lets you store, annotate, and share metadata in the AWS Cloud in the same way you would in an Apache Hive metastore.

Each AWS account has one AWS Glue Data Catalog. It provides a uniform repository where disparate systems can store and find metadata to keep track of data in data silos, and use that metadata to query and transform the data.

You can use AWS Identity and Access Management (IAM) policies to control access to the data sources managed by the AWS Glue Data Catalog. These policies allow different groups in your enterprise to safely publish data to the wider organization while protecting sensitive information. IAM policies let you clearly and consistently define which users have access to which data, regardless of its location.

The Data Catalog also provides comprehensive audit and governance capabilities, with schema change tracking, lineage of data, and data access controls. You can audit changes to data schemas and track the movement of data across systems, to ensure that data is not inappropriately modified or inadvertently shared.

For information about how to use the AWS Glue Data Catalog, see Populating the AWS Glue Data Catalog. For information about how to program using the Data Catalog API, see Catalog API.

AWS Glue Crawlers and Classifiers

AWS Glue also lets you set up crawlers that can scan data in all kinds of repositories, classify it, extract schema information from it, and store the metadata automatically in the AWS Glue Data Catalog. From there it can be used to guide ETL operations.

For information about how to set up crawlers and classifiers, see Cataloging Tables with a Crawler. For information about how to program crawlers and classifiers using the AWS Glue API, see Crawlers and Classifiers API.

AWS Glue ETL Operations

Using the metadata in the Data Catalog, AWS Glue can autogenerate PySpark (the Python API for Apache Spark) scripts with AWS Glue extensions that you can use and modify to perform various ETL operations. For example, you can extract, clean, and transform raw data, and then store the result in a different repository, where it can be queried and analyzed. Such a script might convert a CSV file into a relational form and save it in Amazon Redshift.

For more information about how to use AWS Glue ETL capabilities, see ETL Scripts API.

The AWS Glue Jobs System

The AWS Glue Jobs system provides managed infrastructure to orchestrate your ETL workflow. You can create jobs in AWS Glue that automate the scripts you use to extract, transform, and transfer data to different locations. Jobs can be scheduled and chained, or they can be triggered by events such as the arrival of new data.

For more information about using the AWS Glue Jobs system, see Running and Monitoring Your Data Warehouse. For information about programming using the AWS Glue Jobs system API, see Jobs API.