AWS Glue
Developer Guide

Populating the AWS Glue Data Catalog

The AWS Glue Data Catalog contains references to data that is used as sources and targets of your extract, transform, and load (ETL) jobs in AWS Glue. To create your data warehouse, you must catalog this data. The AWS Glue Data Catalog is an index to the location, schema, and runtime metrics of your data. You use the information in the Data Catalog to create and monitor your ETL jobs. Typically, you run a crawler to take inventory of the data in your data stores, but there are other ways to add metadata tables into your Data Catalog.

You can add table definitions to the AWS Glue Data Catalog in the following ways:

  • Run a crawler that connects to one or more data stores, determines the data structures, and writes tables into the Data Catalog. You can run your crawler on a schedule. For more information, see Cataloging Tables with a Crawler.

  • Use the AWS Glue console to create a table in the AWS Glue Data Catalog. For more information, see Working with Tables on the AWS Glue Console.

    Use the CreateTable operation in the AWS Glue API to create a table in the AWS Glue Data Catalog.

The following workflow diagram shows how AWS Glue crawlers interact with data stores and other elements to populate the Data Catalog.

      Workflow showing how AWS Glue crawler populates the Data Catalog in 5 basic steps.

The following is the general workflow for how a crawler populates the AWS Glue Data Catalog:

  1. A crawler runs any custom classifiers that you choose to infer the schema of your data. You provide the code for custom classifiers, and they run in the order that you specify.

    The first custom classifier to successfully recognize the structure of your data is used to create a schema. Custom classifiers lower in the list are skipped.

  2. If no custom classifier matches your data's schema, built-in classifiers try to recognize your data's schema.

  3. The crawler connects to the data store. Some data stores require connection properties for crawler access.

  4. The inferred schema is created for your data.

  5. The crawler writes metadata to the Data Catalog. A table definition contains metadata about the data in your data store. The table is written to a database, which is a container of tables in the Data Catalog. Attributes of a table include classification, which is a label created by the classifier that inferred the table schema.