Setting crawler configuration options
When a crawler runs, it might encounter changes to your data store that result in a schema or partition that is different from a previous crawl. You can use the AWS Management Console or the AWS Glue API to configure how your crawler processes certain types of changes.
Topics
- Setting crawler configuration options on the AWS Glue console
- Setting the partition index crawler configuration option
- Setting crawler configuration options using the API
- How to prevent the crawler from changing an existing schema
- How to create a single schema for each Amazon S3 include path
- How to specify the table location and partitioning level
- How to specify the maximum number of tables the crawler is allowed to create
- How to specify configuration options for a Delta Lake data store
- How to configure a crawler to use Lake Formation credentials
Setting crawler configuration options on the AWS Glue console
When you define a crawler using the AWS Glue console, you have several options for configuring the behavior of your crawler. For more information about using the AWS Glue console to add a crawler, see Working with crawlers on the AWS Glue console.
When a crawler runs against a previously crawled data store, it might discover that a schema has changed or that some objects in the data store have been deleted. The crawler logs changes to a schema. Depending on the source type for the crawler, new tables and partitions might be created regardless of the schema change policy.
To specify what the crawler does when it finds changes in the schema, you can choose one of the following actions on the console:
Update the table definition in the Data Catalog – Add new columns, remove missing columns, and modify the definitions of existing columns in the AWS Glue Data Catalog. Remove any metadata that is not set by the crawler. This is the default setting.
-
Add new columns only – For tables that map to an Amazon S3 data store, add new columns as they are discovered, but don't remove or change the type of existing columns in the Data Catalog. Choose this option when the current columns in the Data Catalog are correct and you don't want the crawler to remove or change the type of the existing columns. If a fundamental Amazon S3 table attribute changes, such as classification, compression type, or CSV delimiter, mark the table as deprecated. Maintain input format and output format as they exist in the Data Catalog. Update SerDe parameters only if the parameter is one that is set by the crawler. For all other data stores, modify existing column definitions.
Ignore the change and don't update the table in the Data Catalog – Only new tables and partitions are created.
This is the default setting for incremental crawls.
A crawler might also discover new or changed partitions. By default, new partitions are added and existing partitions are updated if they have changed. In addition, you can set a crawler configuration option to Update all new and existing partitions with metadata from the table on the AWS Glue console. When this option is set, partitions inherit metadata properties—such as their classification, input format, output format, SerDe information, and schema—from their parent table. Any changes to these properties in a table are propagated to its partitions. When this configuration option is set on an existing crawler, existing partitions are updated to match the properties of their parent table the next time the crawler runs.
To specify what the crawler does when it finds a deleted object in the data store, choose one of the following actions:
Delete tables and partitions from the Data Catalog
Ignore the change and don't update the table in the Data Catalog
This is the default setting for incremental crawls.
Mark the table as deprecated in the Data Catalog – This is the default setting.
Setting the partition index crawler configuration option
The Data Catalog supports partition indexes to provide efficient lookup for specific partitions. For more information, see Working with partition indexes in AWS Glue.
Currently, the AWS Glue crawler supports creating partition indexes for Amazon S3 and Delta Lake targets.
To specify the crawler create a separate partition index for every Data Catalog table, choose the following option in the console from the Set output and scheduling page's Advanced options:
Create partition indexes automatically
To specify this behavior using the crawler API, set the CreatePartitionIndex
in the Configuration
. The default value is true.
Usage notes for partition indexes
Tables created by the crawler do not have the variable
partition_filtering.enabled
by default. For more information, see AWS Glue partition indexing and filtering.Creating partition indexes for encrypted partitions is not supported.
Setting crawler configuration options using the API
When you define a crawler using the AWS Glue API, you can choose from several fields to
configure your crawler. The SchemaChangePolicy
in the crawler API determines what
the crawler does when it discovers a changed schema or a deleted object. The crawler logs
schema changes as it runs.
When a crawler runs, new tables and partitions are always created regardless of the schema
change policy. You can choose one of the following actions in the UpdateBehavior
field in the SchemaChangePolicy
structure to determine what the crawler does when it finds a changed table schema:
UPDATE_IN_DATABASE
– Update the table in the AWS Glue Data Catalog. Add new columns, remove missing columns, and modify the definitions of existing columns. Remove any metadata that is not set by the crawler.LOG
– Ignore the changes, and don't update the table in the Data Catalog.This is the default setting for incremental crawls.
You can also override the SchemaChangePolicy
structure using a JSON object supplied in
the crawler API Configuration
field. This JSON object can contain a key-value
pair to set the policy to not update existing columns and only add new columns. For example, provide the following
JSON object as a string:
{ "Version": 1.0, "CrawlerOutput": { "Tables": { "AddOrUpdateBehavior": "MergeNewColumns" } } }
This option corresponds to the Add new columns only option on the
AWS Glue console. It overrides the SchemaChangePolicy
structure for tables that result from
crawling Amazon S3 data stores only. Choose this option if you want to maintain the metadata as it
exists in the Data Catalog (the source of truth). New columns are added as they are encountered,
including nested data types. But existing columns are not removed, and their type is not
changed. If an Amazon S3 table attribute changes significantly, mark the table as deprecated, and
log a warning that an incompatible attribute needs to be resolved.
This option is not applicable for incremental crawler.
When a crawler runs against a previously crawled data store, it might discover new or
changed partitions. By default, new partitions are added and existing partitions are updated
if they have changed. In addition, you can set a crawler configuration option to
InheritFromTable
(corresponding to the Update all new and existing
partitions with metadata from the table option on the AWS Glue console). When this
option is set, partitions inherit metadata properties from their parent table, such as their
classification, input format, output format, SerDe information, and schema. Any property
changes to the parent table are propagated to its partitions.
When this configuration option is set on an existing crawler, existing partitions are
updated to match the properties of their parent table the next time the crawler runs. This
behavior is set crawler API Configuration
field. For example, provide the following
JSON object as a string:
{ "Version": 1.0, "CrawlerOutput": { "Partitions": { "AddOrUpdateBehavior": "InheritFromTable" } } }
The crawler API Configuration
field can set multiple configuration options. For example, to configure
the crawler output for both partitions and tables, you can provide a string representation of the following JSON object:
{ "Version": 1.0, "CrawlerOutput": { "Partitions": { "AddOrUpdateBehavior": "InheritFromTable" }, "Tables": {"AddOrUpdateBehavior": "MergeNewColumns" } } }
You can choose one of the following actions to determine what the crawler does when it
finds a deleted object in the data store. The DeleteBehavior
field in the SchemaChangePolicy
structure in the
crawler API sets the behavior of the crawler when it discovers a deleted object.
DELETE_FROM_DATABASE
– Delete tables and partitions from the Data Catalog.LOG
– Ignore the change. Don't update the Data Catalog. Write a log message instead.DEPRECATE_IN_DATABASE
– Mark the table as deprecated in the Data Catalog. This is the default setting.
How to prevent the crawler from changing an existing schema
If you don't want a crawler to overwrite updates you made to existing fields in an Amazon S3 table definition, choose the option on the console to
Add new columns only or set the configuration option MergeNewColumns
.
This applies to tables and partitions, unless Partitions.AddOrUpdateBehavior
is overridden to InheritFromTable
.
If you don't want a table schema to change at all when a crawler runs, set the schema change
policy to LOG
. You can also set a configuration option that sets partition
schemas to inherit from the table.
If you are configuring the crawler on the console, you can choose the following actions:
Ignore the change and don't update the table in the Data Catalog
Update all new and existing partitions with metadata from the table
When you configure the crawler using the API, set the following parameters:
Set the
UpdateBehavior
field inSchemaChangePolicy
structure toLOG
.Set the
Configuration
field with a string representation of the following JSON object in the crawler API; for example:{ "Version": 1.0, "CrawlerOutput": { "Partitions": { "AddOrUpdateBehavior": "InheritFromTable" } } }
How to create a single schema for each Amazon S3 include path
By default, when a crawler defines tables for data stored in Amazon S3, it considers both data compatibility and schema similarity. Data compatibility factors that it considers include whether the data is of the same format (for example, JSON), the same compression type (for example, GZIP), the structure of the Amazon S3 path, and other data attributes. Schema similarity is a measure of how closely the schemas of separate Amazon S3 objects are similar.
You can configure a crawler to CombineCompatibleSchemas
into a common table definition when possible.
With this option, the crawler still considers data compatibility, but ignores the similarity of the specific schemas when evaluating Amazon S3 objects in the specified include path.
If you are configuring the crawler on the console, to combine schemas, select the crawler option Create a single schema for each S3 path.
When you configure the crawler using the API, set the following configuration option:
Set the
Configuration
field with a string representation of the following JSON object in the crawler API; for example:{ "Version": 1.0, "Grouping": { "TableGroupingPolicy": "CombineCompatibleSchemas" } }
To help illustrate this option, suppose that you define a crawler with an include path
s3://bucket/table1/
. When the crawler runs, it finds two JSON files
with the following characteristics:
-
File 1 –
S3://bucket/table1/year=2017/data1.json
-
File content –
{“A”: 1, “B”: 2}
-
Schema –
A:int, B:int
-
File 2 –
S3://bucket/table1/year=2018/data2.json
-
File content –
{“C”: 3, “D”: 4}
-
Schema –
C: int, D: int
By default, the crawler creates two tables, named year_2017
and year_2018
because the schemas are not sufficiently similar.
However, if the option Create a single schema for each S3 path is selected, and if the data is compatible, the crawler creates one table.
The table has the schema A:int,B:int,C:int,D:int
and partitionKey
year:string
.
How to specify the table location and partitioning level
By default, when a crawler defines tables for data stored in Amazon S3 the crawler attempts to
merge schemas together and create top-level tables (year=2019
). In some cases,
you may expect the crawler to create a table for the folder month=Jan
but instead
the crawler creates a partition since a sibling folder (month=Mar
) was merged
into the same table.
The table level crawler option provides you the flexibility to tell the crawler where the tables are located, and how you want partitions created. When you specify a Table level, the table is created at that absolute level from the Amazon S3 bucket.

When configuring the crawler on the console, you can specify a value for the
Table level crawler option. The value must be a positive integer that
indicates the table location (the absolute level in the dataset). The level for the top level
folder is 1. For example, for the path mydataset/a/b
, if the level is set to 3,
the table is created at location mydataset/a/b
.
How to specify the maximum number of tables the crawler is allowed to create
You can optionally specify the maximum number of tables the crawler is allowed to create by specifying a TableThreshold
via the AWS Glue console or CLI. If the tables detected by the crawler during its crawl is greater that this input value, the crawl fails and no data is written to the Data Catalog.
This parameter is useful when the tables that would be detected and created by the crawler are much greater more than what you expect. There can be multiple reasons for this, such as:
When using an AWS Glue job to populate your Amazon S3 locations you can end up with empty files at the same level as a folder. In such cases when you run a crawler on this Amazon S3 location, the crawler creates multiple tables due to files and folders present at the same level.
If you do not configure
"TableGroupingPolicy": "CombineCompatibleSchemas"
you may end up with more tables than expected.
You specify the TableThreshold
as an integer value greater than 0. This value is configured on a per crawler basis. That is, for every crawl this value is considered. For example: a crawler has the TableThreshold
value set as 5. In each crawl AWS Glue compares the number of tables detected with this table threshold value (5) and if the number of tables detected is less than 5, AWS Glue writes the tables to the Data Catalog and if not, the crawl fails without writing to the Data Catalog.
Console
To set TableThreshold
using the AWS console:

CLI
To set TableThreshold
using the AWS CLI:
"{"Version":1.0, "CrawlerOutput": {"Tables":{"AddOrUpdateBehavior":"MergeNewColumns", "TableThreshold":5}}}";
Error messages are logged to help you identify table paths and clean-up your data. Example log in your account if the crawler fails because the table count was greater than table threshold value provided:
Table Threshold value = 28, Tables detected - 29
In CloudWatch, we log all table locations detected as an INFO message. An error is logged as the reason for the failure.
ERROR com.amazonaws.services.glue.customerLogs.CustomerLogService - CustomerLogService received CustomerFacingException with message The number of tables detected by crawler: 29 is greater than the table threshold value provided: 28. Failing crawler without writing to Data Catalog. com.amazonaws.services.glue.exceptions.CustomerFacingInternalException: The number of tables detected by crawler: 29 is greater than the table threshold value provided: 28. Failing crawler without writing to Data Catalog.
How to specify configuration options for a Delta Lake data store
When you configure a crawler for a Delta Lake data store, you specify these configuration parameters:
- Connection
-
Optionally select or add a Network connection to use with this Amazon S3 target. For information about connections, see Connecting to data.
- Create tables for querying
-
Select how you want to create the Delta Lake tables:
Create Native tables: Allow integration with query engines that support querying of the Delta transaction log directly.
Create Symlink tables: Create a symlink manifest folder with manifest files partitioned by the partition keys, based on the specified configuration parameters.
- Enable write manifest (configurable only you've selected to Create Symlink tables for a Delta Lake source
-
Select whether to detect table metadata or schema changes in the Delta Lake transaction log; it regenerates the manifest file. You should not choose this option if you configured an automatic manifest update with Delta Lake
SET TBLPROPERTIES
. - Include delta lake table path(s)
-
Specify one or more Amazon S3 paths to Delta tables as s3://
bucket
/prefix
/object
.

How to configure a crawler to use Lake Formation credentials
You can configure a crawler to use AWS Lake Formation credentials to access an Amazon S3 data store or a Data Catalog table with an underlying Amazon S3 location within the same AWS account or another AWS account. You can configure an existing Data Catalog table as a crawler's target, if the crawler and the Data Catalog table reside in the same account. Currently, only a single catalog target with a single catalog table is allowed when using a Data Catalog table as a cralwer's target.
Note
When you are defining a Data Catalog table as a crawler target, make sure that the underlying location of the Data Catalog table is an Amazon S3 location. Crawlers that use Lake Formation credentials only support Data Catalog targets with underlying Amazon S3 locations.
Setup required when the crawler and registered Amazon S3 location or Data Catalog table reside in the same account (in-account crawling)
To allow the crawler to access a data store or Data Catalog table by using Lake Formation credentials, you need to register the data location with Lake Formation. Also, the crawler's IAM role must have permissions to read the data from the destination where the Amazon S3 bucket is registered.
You can complete the following configuration steps using the AWS Management Console or AWS Command Line Interface (AWS CLI).
Setup required when the crawler and registered Amazon S3 location reside in different accounts (cross-account crawling)
To allow the crawler to access a data store in a different account using Lake Formation credentials, you must first register the Amazon S3 data location with Lake Formation. Then, you grant data location permissions to the crawler's account by taking the following steps.
You can complete the following steps using the AWS Management Console or AWS CLI.
Note
A crawler using Lake Formation credentials is only supported for Amazon S3 and Data Catalog targets.
For targets using Lake Formation credential vending, the underlying Amazon S3 locations must belong to the same bucket. For example, customers can use multiple targets (s3://bucket1/folder1, s3://bucket1/folder2) as long as all target locations are under the same bucket (bucket1). Specifying different buckets (s3://bucket1/folder1, s3://bucket2/folder2) is not allowed.
Currently for Data Catalog target crawlers, only a single catalog target with a single catalog table is allowed.