Crawler properties
When defining a crawler using the AWS Glue console or the AWS Glue API, you specify the following information:
Step 1: Set crawler properties
- Name
-
Name may contain letters (A-Z), numbers (0-9), hyphens (-), or underscores (_), and can be up to 255 characters long.
- Description
-
Descriptions can be up to 2048 characters long.
- Tags
-
Use tags to organize and identify your resources. For more information, see the following:
Step 2: Choose data sources and classifiers
- Data source configuration
-
Select the appropriate option for Is your data already mapped to AWS Glue tables?
The crawler can access data stores directly as the source of the crawl, or it can use existing tables in the Data Catalog as the source. If the crawler uses existing catalog tables, it crawls the data stores that are specified by those catalog tables. For more information, see Crawler source type.
Not yet: Select one or more data sources to be crawled. A crawler can crawl multiple data stores of different types (Amazon S3, JDBC, and so on).
You can configure only one data store at a time. After you have provided the connection information and include paths and exclude patterns, you then have the option of adding another data store.
For more information, see Crawler source type.
Yes: Select existing tables from your AWS Glue Data Catalog. The catalog tables specify the data stores to crawl. The crawler can crawl only catalog tables in a single run; it can't mix in other source types.
- Data sources
-
Select or add the list of data sources to be scanned by the crawler.
- Include path
-
- For an Amazon S3 data store
-
Choose whether to specify a path in this account or in a different account, and then browse to choose an Amazon S3 path.
- For a Delta Lake data store
-
Specify one or more Amazon S3 paths to Delta tables as s3://
bucket
/prefix
/object
. - For a JDBC data store
-
Enter
<database>
/<schema>
/<table>
or<database>
/<table>
, depending on the database product. Oracle Database and MySQL don’t support schema in the path. You can substitute the percent (%) character for<schema>
or<table>
. For example, for an Oracle database with a system identifier (SID) oforcl
, enterorcl/%
to import all tables to which the user named in the connection has access.Important This field is case-sensitive.
- For a MongoDB, MongoDB Atlas, or Amazon DocumentDB data store
-
Enter
database
/collection
.
For more information, see Include and exclude patterns.
- Exclude patterns
-
These enable you to exclude certain files or tables from the crawl. For more information, see Include and exclude patterns.
- Additional crawler source parameters
-
Each source type requires a different set of additional parameters. The following is an incomplete list:
- Connection
-
Select or add an AWS Glue connection. For information about connections, see Defining connections in the AWS Glue Data Catalog.
- Additional metadata - optional (for JDBC data stores)
-
Select additional metadata properties for the crawler to crawl.
Comments: Crawl associated table level and column level comments.
Raw types: Persist the raw datatypes of the table columns in additional metadata. As a default behavior, the crawler translates the raw datatypes to Hive-compatible types.
- Enable data sampling (for Amazon DynamoDB, MongoDB, MongoDB Atlas, and Amazon DocumentDB data stores only)
-
Select whether to crawl a data sample only. If not selected the entire table is crawled. Scanning all the records can take a long time when the table is not a high throughput table.
- Create tables for querying (for Delta Lake data stores only)
-
Select how you want to create the Delta Lake tables:
Create Native tables: Allow integration with query engines that support querying of the Delta transaction log directly.
Create Symlink tables: Create a symlink manifest folder with manifest files partitioned by the partition keys, based on the specified configuration parameters.
- Scanning rate - optional (for DynamoDB data stores only)
-
Specify the percentage of the DynamoDB table Read Capacity Units to use by the crawler. Read capacity units is a term defined by DynamoDB, and is a numeric value that acts as rate limiter for the number of reads that can be performed on that table per second. Enter a value between 0.1 and 1.5. If not specified, defaults to 0.5% for provisioned tables and 1/4 of maximum configured capacity for on-demand tables. Note that only provisioned capacity mode should be used with AWS Glue crawlers.
Note For DynamoDB data stores, set the provisioned capacity mode for processing reads and writes on your tables. The AWS Glue crawler should not be used with the on-demand capacity mode.
- Network connection - optional (for Amazon S3 data stores only)
-
Optionally include a Network connection to use with this Amazon S3 target. Note that each crawler is limited to one Network connection so any other Amazon S3 targets will also use the same connection (or none, if left blank).
For information about connections, see Defining connections in the AWS Glue Data Catalog.
- Sample only a subset of files and Sample size (for Amazon S3 data stores only)
-
Specify the number of files in each leaf folder to be crawled when crawling sample files in a dataset. When this feature is turned on, instead of crawling all the files in this dataset, the crawler randomly selects some files in each leaf folder to crawl.
The sampling crawler is best suited for customers who have previous knowledge about their data formats and know that schemas in their folders do not change. Turning on this feature will significantly reduce crawler runtime.
A valid value is an integer between 1 and 249. If not specified, all the files are crawled.
- Subsequent crawler runs
-
This field is a global field that affects all Amazon S3 data sources.
Crawl all sub-folders: Crawl all folders again with every subsequent crawl.
Crawl new sub-folders only: Only Amazon S3 folders that were added since the last crawl will be crawled. If the schemas are compatible, new partitions will be added to existing tables. For more information, see Incremental crawls in AWS Glue.
Crawl based on events: Rely on Amazon S3 events to control what folders to crawl. For more information, see Accelerating crawls using Amazon S3 event notifications.
- Custom classifiers - optional
-
Define custom classifiers before defining crawlers. A classifier checks whether a given file is in a format the crawler can handle. If it is, the classifier creates a schema in the form of a
StructType
object that matches that data format.For more information, see Adding classifiers to a crawler in AWS Glue.
Step 3: Configure security settings
- IAM role
-
The crawler assumes this role. It must have permissions similar to the AWS managed policy
AWSGlueServiceRole
. For Amazon S3 and DynamoDB sources, it must also have permissions to access the data store. If the crawler reads Amazon S3 data encrypted with AWS Key Management Service (AWS KMS), then the role must have decrypt permissions on the AWS KMS key.For an Amazon S3 data store, additional permissions attached to the role would be similar to the following:
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "s3:GetObject", "s3:PutObject" ], "Resource": [ "arn:aws:s3:::
bucket/object
*" ] } ] }For an Amazon DynamoDB data store, additional permissions attached to the role would be similar to the following:
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "dynamodb:DescribeTable", "dynamodb:Scan" ], "Resource": [ "arn:aws:dynamodb:
region
:account-id
:table/table-name
*" ] } ] }For more information, see Step 2: Create an IAM role for AWS Glue and Identity and access management for AWS Glue.
- Lake Formation configuration - optional
-
Allow the crawler to use Lake Formation credentials for crawling the data source.
Checking Use Lake Formation credentials for crawling S3 data source will allow the crawler to use Lake Formation credentials for crawling the data source. If the data source belongs to another account, you must provide the registered account ID. Otherwise, the crawler will crawl only those data sources associated to the account. Only applicable to Amazon S3 and Data Catalog data sources.
- Security configuration - optional
-
Settings include security configurations. For more information, see the following:
Step 4: Set output and scheduling
- Output configuration
-
Options include how the crawler should handle detected schema changes, deleted objects in the data store, and more. For more information, see Setting crawler configuration options
- Crawler schedule
-
You can run a crawler on demand or define a time-based schedule for your crawlers and jobs in AWS Glue. The definition of these schedules uses the Unix-like cron syntax. For more information, see Scheduling an AWS Glue crawler.
Step 5: Review and create
Review the crawler settings you configured, and create the crawler.
Crawler source type
A crawler can access data stores directly as the source of the crawl or use existing catalog tables as the source. If the crawler uses existing catalog tables, it crawls the data stores specified by those catalog tables.
A common reason to specify a catalog table as the source is when you create the table manually (because you already know the structure of the data store) and you want a crawler to keep the table updated, including adding new partitions. For a discussion of other reasons, see Updating manually created Data Catalog tables using crawlers.
When you specify existing tables as the crawler source type, the following conditions apply:
-
Database name is optional.
-
Only catalog tables that specify Amazon S3 or Amazon DynamoDB data stores are permitted.
-
No new catalog tables are created when the crawler runs. Existing tables are updated as needed, including adding new partitions.
-
Deleted objects found in the data stores are ignored; no catalog tables are deleted. Instead, the crawler writes a log message. (
SchemaChangePolicy.DeleteBehavior=LOG
) -
The crawler configuration option to create a single schema for each Amazon S3 path is enabled by default and cannot be disabled. (
TableGroupingPolicy
=CombineCompatibleSchemas
) For more information, see How to create a single schema for each Amazon S3 include path. -
You can't mix catalog tables as a source with any other source types (for example Amazon S3 or Amazon DynamoDB).
Include and exclude patterns
When evaluating what to include or exclude in a crawl, a crawler starts by evaluating the required include path. For Amazon S3, MongoDB, MongoDB Atlas, Amazon DocumentDB (with MongoDB compatibility), and relational data stores, you must specify an include path.
For Amazon S3 data stores, include path syntax is
bucket-name/folder-name/file-name.ext
. To crawl all objects in a bucket, you
specify just the bucket name in the include path. The exclude pattern is relative to the
include path
For MongoDB, MongoDB Atlas, and Amazon DocumentDB (with MongoDB compatibility), the syntax is
database/collection
.
For JDBC data stores, the syntax is either
database-name/schema-name/table-name
or
database-name/table-name
. The syntax depends on whether the database engine
supports schemas within a database. For example, for database engines such as MySQL or
Oracle, don't specify a schema-name
in your include path. You can substitute
the percent sign (%
) for a schema or table in the include path to represent all
schemas or all tables in a database. You cannot substitute the percent sign (%
)
for database in the include path. The exclude path is relative to the include path. For
example, to exclude a table in your JDBC data store, type the table name in the exclude
path.
A crawler connects to a JDBC data store using an AWS Glue connection that contains a JDBC
URI connection string. The crawler only has access to objects in the database engine using
the JDBC user name and password in the AWS Glue connection. The crawler can only
create tables that it can access through the JDBC connection. After the crawler
accesses the database engine with the JDBC URI, the include path is used to determine which
tables in the database engine are created in the Data Catalog. For example, with MySQL, if you
specify an include path of MyDatabase/%
, then all tables within
MyDatabase
are created in the Data Catalog. When accessing Amazon Redshift, if you specify an
include path of MyDatabase/%
, then all tables within all schemas for database
MyDatabase
are created in the Data Catalog. If you specify an include path of
MyDatabase/MySchema/%
, then all tables in database MyDatabase
and schema MySchema
are created.
After you specify an include path, you can then exclude objects from the crawl that your
include path would otherwise include by specifying one or more Unix-style glob
exclude patterns. These patterns are applied to your include path to determine which objects
are excluded. These patterns are also stored as a property of tables created by the crawler.
AWS Glue PySpark extensions, such as create_dynamic_frame.from_catalog
, read the
table properties and exclude objects defined by the exclude pattern.
AWS Glue supports the following kinds of glob
patterns in the exclude pattern.
Exclude pattern | Description |
---|---|
*.csv |
Matches an Amazon S3 path that represents an object name in the current folder
ending in .csv |
*.* |
Matches all object names that contain a dot |
*.{csv,avro} |
Matches object names ending with .csv or
.avro |
foo.? |
Matches object names starting with foo. that are followed by a
single character extension |
myfolder/* |
Matches objects in one level of subfolder from myfolder , such as
/myfolder/mysource |
myfolder/*/* |
Matches objects in two levels of subfolders from myfolder , such as
/myfolder/mysource/data |
myfolder/** |
Matches objects in all subfolders of myfolder , such as
/myfolder/mysource/mydata and
/myfolder/mysource/data |
myfolder** |
Matches subfolder myfolder as well as files below
myfolder , such as /myfolder and
/myfolder/mydata.txt |
Market* |
Matches tables in a JDBC database with names that begin with
Market , such as Market_us and
Market_fr |
AWS Glue interprets glob
exclude patterns as follows:
-
The slash (
/
) character is the delimiter to separate Amazon S3 keys into a folder hierarchy. -
The asterisk (
*
) character matches zero or more characters of a name component without crossing folder boundaries. -
A double asterisk (
**
) matches zero or more characters crossing folder or schema boundaries. -
The question mark (
?
) character matches exactly one character of a name component. -
The backslash (
\
) character is used to escape characters that otherwise can be interpreted as special characters. The expression\\
matches a single backslash, and\{
matches a left brace. -
Brackets
[ ]
create a bracket expression that matches a single character of a name component out of a set of characters. For example,[abc]
matchesa
,b
, orc
. The hyphen (-
) can be used to specify a range, so[a-z]
specifies a range that matches froma
throughz
(inclusive). These forms can be mixed, so [abce-g
] matchesa
,b
,c
,e
,f
, org
. If the character after the bracket ([
) is an exclamation point (!
), the bracket expression is negated. For example,[!a-c]
matches any character excepta
,b
, orc
.Within a bracket expression, the
*
,?
, and\
characters match themselves. The hyphen (-
) character matches itself if it is the first character within the brackets, or if it's the first character after the!
when you are negating. -
Braces (
{ }
) enclose a group of subpatterns, where the group matches if any subpattern in the group matches. A comma (,
) character is used to separate the subpatterns. Groups cannot be nested. -
Leading period or dot characters in file names are treated as normal characters in match operations. For example, the
*
exclude pattern matches the file name.hidden
.
Example Amazon S3 exclude patterns
Each exclude pattern is evaluated against the include path. For example, suppose that you have the following Amazon S3 directory structure:
/mybucket/myfolder/
departments/
finance.json
market-us.json
market-emea.json
market-ap.json
employees/
hr.json
john.csv
jane.csv
juan.txt
Given the include path s3://mybucket/myfolder/
, the following are some
sample results for exclude patterns:
Exclude pattern | Results |
---|---|
departments/** |
Excludes all files and folders below departments and includes the
employees folder and its files |
departments/market* |
Excludes market-us.json , market-emea.json , and
market-ap.json |
**.csv |
Excludes all objects below myfolder that have a name ending with
.csv |
employees/*.csv |
Excludes all .csv files in the employees
folder |
Example Excluding a subset of Amazon S3 partitions
Suppose that your data is partitioned by day, so that each day in a year is in a separate Amazon S3 partition. For January 2015, there are 31 partitions. Now, to crawl data for only the first week of January, you must exclude all partitions except days 1 through 7:
2015/01/{[!0],0[8-9]}**, 2015/0[2-9]/**, 2015/1[0-2]/**
Take a look at the parts of this glob pattern. The first part,
2015/01/{[!0],0[8-9]}**
, excludes all days that don't begin with a "0" in
addition to day 08 and day 09 from month 01 in year 2015. Notice that "**" is used as the
suffix to the day number pattern and crosses folder boundaries to lower-level folders. If
"*" is used, lower folder levels are not excluded.
The second part, 2015/0[2-9]/**
, excludes days in months 02 to 09, in
year 2015.
The third part, 2015/1[0-2]/**
, excludes days in months 10, 11, and 12,
in year 2015.
Example JDBC exclude patterns
Suppose that you are crawling a JDBC database with the following schema structure:
MyDatabase/MySchema/
HR_us
HR_fr
Employees_Table
Finance
Market_US_Table
Market_EMEA_Table
Market_AP_Table
Given the include path MyDatabase/MySchema/%
, the following are some
sample results for exclude patterns:
Exclude pattern | Results |
---|---|
HR* |
Excludes the tables with names that begin with HR |
Market_* |
Excludes the tables with names that begin with Market_ |
**_Table |
Excludes all tables with names that end with _Table |