Which data stores can I crawl? - AWS Glue

Which data stores can I crawl?

Crawlers can crawl the following file-based and table-based data stores.

Access type that crawler uses Data stores
Native client
  • Amazon Simple Storage Service (Amazon S3)

  • Amazon DynamoDB

  • Delta Lake

JDBC

Amazon Redshift

Snowflake

Within Amazon Relational Database Service (Amazon RDS) or external to Amazon RDS:

  • Amazon Aurora

  • MariaDB

  • Microsoft SQL Server

  • MySQL

  • Oracle

  • PostgreSQL

MongoDB client
  • MongoDB

  • MongoDB Atlas

  • Amazon DocumentDB (with MongoDB compatibility)

Note

Currently AWS Glue does not support crawlers for data streams.

For JDBC, MongoDB, MongoDB Atlas, and Amazon DocumentDB (with MongoDB compatibility) data stores, you must specify an AWS Glue connection that the crawler can use to connect to the data store. For Amazon S3, you can optionally specify a connection of type Network. A connection is a Data Catalog object that stores connection information, such as credentials, URL, Amazon Virtual Private Cloud information, and more. For more information, see Defining connections in the AWS Glue Data Catalog.

The following are notes about the various data stores.

Amazon S3

You can choose to crawl a path in your account or in another account. If all the Amazon S3 files in a folder have the same schema, the crawler creates one table. Also, if the Amazon S3 object is partitioned, only one metadata table is created and partition information is added to the Data Catalog for that table.

Amazon S3 and Amazon DynamoDB

Crawlers use an AWS Identity and Access Management (IAM) role for permission to access your data stores. The role you pass to the crawler must have permission to access Amazon S3 paths and Amazon DynamoDB tables that are crawled.

Amazon DynamoDB

When defining a crawler using the AWS Glue console, you specify one DynamoDB table. If you're using the AWS Glue API, you can specify a list of tables. You can choose to crawl only a small sample of the data to reduce crawler run times.

Delta Lake

For each Delta Lake data store, you specify how to create the Delta tables:

  • Create Native tables: Allow integration with query engines that support querying of the Delta transaction log directly. For more information, see Querying Delta Lake tables.

  • Create Symlink tables: Create a _symlink_manifest folder with manifest files partitioned by the partition keys, based on the specified configuration parameters.

MongoDB and Amazon DocumentDB (with MongoDB compatibility)

MongoDB versions 3.2 and later are supported. You can choose to crawl only a small sample of the data to reduce crawler run times.

Relational database

Authentication is with a database user name and password. Depending on the type of database engine, you can choose which objects are crawled, such as databases, schemas, and tables.

Snowflake

The Snowflake JDBC crawler supports crawling the Table, External Table, View, and Materialized View. The Materialized View Definition will not be populated.

For Snowflake external tables, the crawler only will crawl if it points to an Amazon S3 location. In addition to the the table schema, the crawler will also crawl the Amazon S3 location, file format and output as table parameters in the Data Catalog table. Note that the partition information of the partitioned external table is not populated.

ETL is currently not supported for Data Catalog tables created using the Snowflake crawler.