Menu
AWS Glue
Developer Guide

Cataloging Tables with a Crawler

You can use a crawler to populate the AWS Glue Data Catalog with tables. This is the primary method used by most AWS Glue users. You add a crawler within your Data Catalog to traverse your data stores. The output of the crawler consists of one or more metadata tables that are defined in your Data Catalog. Extract, transform, and load (ETL) jobs that you define in AWS Glue use these metadata tables as sources and targets.

Your crawler uses an AWS Identity and Access Management (IAM) role for permission to access your data stores and the Data Catalog. The role you pass to the crawler must have permission to access Amazon S3 paths that are crawled. Some data stores require additional authorization to establish a connection. For more information, see Adding a Connection to Your Data Store.

For more information about using the AWS Glue console to add a crawler, see Working with Crawlers on the AWS Glue Console.

Defining a Crawler in the AWS Glue Data Catalog

When you define a crawler, you choose one or more classifiers that evaluate the format of your data to infer a schema. When the crawler runs, the first classifier in your list to successfully recognize your data store is used to create a schema for your table. You can use built-in classifiers or define your own. AWS Glue provides built-in classifiers to infer schemas from common files with formats that include JSON, CSV, and Avro. For the current list of built-in classifiers in AWS Glue, see Built-In Classifiers in AWS Glue .

Which Data Stores Can I Crawl?

A crawler can crawl both file-based and relational table-based data stores. Crawlers can crawl the following data stores:

  • Amazon Simple Storage Service (Amazon S3)

  • Amazon Redshift

  • Amazon Relational Database Service

    • Amazon Aurora

    • MariaDB

    • MySQL

    • PostgreSQL

  • Publicly accessible databases

    • Amazon Aurora

    • MariaDB

    • MySQL

    • PostgreSQL

When you define an Amazon S3 data store to crawl, you can choose whether to crawl a path in your account or another account. You choose the path to be crawled using the form Bucket/Folder/File. The output of the crawler is one or more metadata tables defined in the AWS Glue Data Catalog. A table is created for one or more files found in your data store. If all the Amazon S3 files in a folder have the same schema, the crawler creates one table. Also, if the Amazon S3 object is partitioned, only one metadata table is created.

If the data store that is being crawled is a relational database, the output is also a set of metadata tables defined in the AWS Glue Data Catalog. You can choose all databases, schemas, and tables in your data store. Alternatively, you can choose the tables to be crawled using the form Database/Schema/Table. When you crawl a relational database, you must provide connection information for authorization credentials.

When selecting a path to include or exclude, a crawler follows these rules:

  1. For each Amazon S3 data store you want to crawl, you specify a single include path with the syntax bucket-name/folder-name/file-name.ext. To crawl all objects in a bucket, you specify % for the folder-name/file-name.ext in the include path. To exclude objects from the crawler, you specify an exclude path that is relative to the include path. All folders and files that start with the exclude path name, which are in the folder specified by the include path, are skipped. If you have multiple folders or files to exclude, you can specify multiple exclude paths.

    For example, if you specify an include path for Amazon S3 as MyBucket/MyFolder/ and exclude paths Unwanted and Ignore, all objects in bucket MyBucket and in folder MyFolder are crawled except for any folders or files that start with the names Unwanted and Ignore. However, if files and folders by these names appear at a lower folder level, for example bucket-name/folder-name/subfolder-name/Unwanted, they are not skipped.

    In this example, the following folders and files are skipped if they appear at the next level under the include path:

    • Unwanted01

    • Unwanted02

    • Ignore.txt

  2. For each relational database you want to crawl, you specify a single include path with the syntax database-name/schema-name/table-name. For relational database engines, such as MySQL, without a schema-name, you can specify database-name/table-name. You can substitute a multiple character wildcard, the percent (%) character, for a schema-name or a table-name to act as a wildcard for that part of the path. For example, an include path of MyDatabase/MySchema/% crawls all tables in database MyDatabase and schema MySchema.

    The exclude path evaluation is relative to its corresponding include path. For example, if you specify an include path for a JDBC database as MyDatabase/MySchema/% and exclude path MyTrash, all tables in the database MyDatabase and in the schema MySchema are crawled except for table MyTrash.

  3. You must specify one include path for each data store that is defined in your crawler. For each include path, you can specify zero or more exclude paths.

What Happens When a Crawler Runs?

The metadata tables that a crawler creates are contained in a database as defined by the crawler. If your crawler does not define a database, your tables are placed in the default database. In addition, each table has a classification column that is filled in by the classifier that first successfully recognized the data store.

The crawler can process both relational database and file data stores. If the file that is crawled is compressed, the crawler must download it to process it.

The crawler generates the names for the tables it creates. The name of the tables that are stored in the AWS Glue Data Catalog follow these rules:

  • Only alphanumeric characters and underscore (_) are allowed.

  • Any custom prefix cannot be longer than 64 characters.

  • Maximum length of the name cannot be longer than 128 characters. The crawler truncates generated names to fit within the limit.

  • If duplicate table names are encountered, the crawler adds a hash string suffix to the name.

If your crawler runs more than once, perhaps on a schedule, it looks for new or changed files or tables in your data store. The output of the crawler includes new tables found since a previous run.

What Happens When a Crawler Detects Schema Changes?

When a crawler runs against a previously crawled data store, it might discover that a schema is changed or that objects in the data store are now deleted. The crawler logs schema changes as it runs. You specify the behavior of the crawler when it finds changes in the schema.

When the crawler finds a changed schema, you choose one of the following actions:

  • Update the table in the AWS Glue Data Catalog. This is the default.

  • Ignore the change. Do not modify the table in the Data Catalog.

When the crawler finds a deleted object in the data store, you choose one of the following actions:

  • Delete the table from the Data Catalog.

  • Ignore the change. Do not modify the table in the Data Catalog.

  • Mark the table as deprecated in the Data Catalog. This is the default.