Setting crawler configuration options
When a crawler runs, it might encounter changes to your data store that result in a schema or partition that is different from a previous crawl. You can use the AWS Management Console or the AWS Glue API to configure how your crawler processes certain types of changes.
Topics
- Setting the partition index crawler configuration option
- How to prevent the crawler from changing an existing schema
- How to create a single schema for each Amazon S3 include path
- How to specify the table location and partitioning level
- How to specify the maximum number of tables the crawler is allowed to create
- How to specify configuration options for a Delta Lake data store
- How to configure a crawler to use Lake Formation credentials
Setting the partition index crawler configuration option
The Data Catalog supports partition indexes to provide efficient lookup for specific partitions. For more information, see Working with partition indexes in AWS Glue. The AWS Glue crawler creates partition indexes for Amazon S3 and Delta Lake targets by deafult.
When you define a cralwer, the option to Create partition indexes automatically is enabled by default under Advanced options on the Set output and scheduling page.
To disable this option, you can unselect the checkbox Create partition indexes
automatically in the console. You can also disable this option by using the
crawler API, set the CreatePartitionIndex
in the Configuration
. The
default value is true.
Usage notes for partition indexes
Tables created by the crawler do not have the variable
partition_filtering.enabled
by default. For more information, see AWS Glue partition indexing and filtering.Creating partition indexes for encrypted partitions is not supported.
How to prevent the crawler from changing an existing schema
If you don't want a crawler to overwrite updates you made to existing fields in an Amazon S3 table definition, choose the option on the console to
Add new columns only or set the configuration option MergeNewColumns
.
This applies to tables and partitions, unless Partitions.AddOrUpdateBehavior
is overridden to InheritFromTable
.
If you don't want a table schema to change at all when a crawler runs, set the schema change
policy to LOG
. You can also set a configuration option that sets partition
schemas to inherit from the table.
If you are configuring the crawler on the console, you can choose the following actions:
Ignore the change and don't update the table in the Data Catalog
Update all new and existing partitions with metadata from the table
When you configure the crawler using the API, set the following parameters:
Set the
UpdateBehavior
field inSchemaChangePolicy
structure toLOG
.Set the
Configuration
field with a string representation of the following JSON object in the crawler API; for example:{ "Version": 1.0, "CrawlerOutput": { "Partitions": { "AddOrUpdateBehavior": "InheritFromTable" } } }
How to create a single schema for each Amazon S3 include path
By default, when a crawler defines tables for data stored in Amazon S3, it considers both data compatibility and schema similarity. Data compatibility factors that it considers include whether the data is of the same format (for example, JSON), the same compression type (for example, GZIP), the structure of the Amazon S3 path, and other data attributes. Schema similarity is a measure of how closely the schemas of separate Amazon S3 objects are similar.
You can configure a crawler to CombineCompatibleSchemas
into a common table definition when possible.
With this option, the crawler still considers data compatibility, but ignores the similarity of the specific schemas when evaluating Amazon S3 objects in the specified include path.
If you are configuring the crawler on the console, to combine schemas, select the crawler option Create a single schema for each S3 path.
When you configure the crawler using the API, set the following configuration option:
Set the
Configuration
field with a string representation of the following JSON object in the crawler API; for example:{ "Version": 1.0, "Grouping": { "TableGroupingPolicy": "CombineCompatibleSchemas" } }
To help illustrate this option, suppose that you define a crawler with an include path
s3://bucket/table1/
. When the crawler runs, it finds two JSON files
with the following characteristics:
-
File 1 –
S3://bucket/table1/year=2017/data1.json
-
File content –
{“A”: 1, “B”: 2}
-
Schema –
A:int, B:int
-
File 2 –
S3://bucket/table1/year=2018/data2.json
-
File content –
{“C”: 3, “D”: 4}
-
Schema –
C: int, D: int
By default, the crawler creates two tables, named year_2017
and year_2018
because the schemas are not sufficiently similar.
However, if the option Create a single schema for each S3 path is selected, and if the data is compatible, the crawler creates one table.
The table has the schema A:int,B:int,C:int,D:int
and partitionKey
year:string
.
How to specify the table location and partitioning level
By default, when a crawler defines tables for data stored in Amazon S3 the crawler attempts to
merge schemas together and create top-level tables (year=2019
). In some cases,
you may expect the crawler to create a table for the folder month=Jan
but instead
the crawler creates a partition since a sibling folder (month=Mar
) was merged
into the same table.
The table level crawler option provides you the flexibility to tell the crawler where the tables are located, and how you want partitions created. When you specify a Table level, the table is created at that absolute level from the Amazon S3 bucket.
When configuring the crawler on the console, you can specify a value for the Table level crawler option.
The value must be a positive integer that indicates the table location (the absolute level in the dataset).
The level for the top level folder is 1. For example, for the path mydataset/year/month/day/hour
, if the level is set to 3,
the table is created at location mydataset/year/month
.
How to specify the maximum number of tables the crawler is allowed to create
You can optionally specify the maximum number of tables the crawler is allowed to create by specifying a TableThreshold
via the AWS Glue console or CLI. If the tables detected by the crawler during its crawl is greater that this input value, the crawl fails and no data is written to the Data Catalog.
This parameter is useful when the tables that would be detected and created by the crawler are much greater more than what you expect. There can be multiple reasons for this, such as:
When using an AWS Glue job to populate your Amazon S3 locations you can end up with empty files at the same level as a folder. In such cases when you run a crawler on this Amazon S3 location, the crawler creates multiple tables due to files and folders present at the same level.
If you do not configure
"TableGroupingPolicy": "CombineCompatibleSchemas"
you may end up with more tables than expected.
You specify the TableThreshold
as an integer value greater than 0. This value is configured on a per crawler basis. That is, for every crawl this value is considered. For example: a crawler has the TableThreshold
value set as 5. In each crawl AWS Glue compares the number of tables detected with this table threshold value (5) and if the number of tables detected is less than 5, AWS Glue writes the tables to the Data Catalog and if not, the crawl fails without writing to the Data Catalog.
Console
To set TableThreshold
using the AWS console:
CLI
To set TableThreshold
using the AWS CLI:
"{"Version":1.0, "CrawlerOutput": {"Tables":{"AddOrUpdateBehavior":"MergeNewColumns", "TableThreshold":5}}}";
Error messages are logged to help you identify table paths and clean-up your data. Example log in your account if the crawler fails because the table count was greater than table threshold value provided:
Table Threshold value = 28, Tables detected - 29
In CloudWatch, we log all table locations detected as an INFO message. An error is logged as the reason for the failure.
ERROR com.amazonaws.services.glue.customerLogs.CustomerLogService - CustomerLogService received CustomerFacingException with message The number of tables detected by crawler: 29 is greater than the table threshold value provided: 28. Failing crawler without writing to Data Catalog. com.amazonaws.services.glue.exceptions.CustomerFacingInternalException: The number of tables detected by crawler: 29 is greater than the table threshold value provided: 28. Failing crawler without writing to Data Catalog.
How to specify configuration options for a Delta Lake data store
When you configure a crawler for a Delta Lake data store, you specify these configuration parameters:
- Connection
-
Optionally select or add a Network connection to use with this Amazon S3 target. For information about connections, see Connecting to data.
- Create tables for querying
-
Select how you want to create the Delta Lake tables:
Create Native tables: Allow integration with query engines that support querying of the Delta transaction log directly.
Create Symlink tables: Create a symlink manifest folder with manifest files partitioned by the partition keys, based on the specified configuration parameters.
- Enable write manifest (configurable only you've selected to Create Symlink tables for a Delta Lake source
-
Select whether to detect table metadata or schema changes in the Delta Lake transaction log; it regenerates the manifest file. You should not choose this option if you configured an automatic manifest update with Delta Lake
SET TBLPROPERTIES
. - Include delta lake table path(s)
-
Specify one or more Amazon S3 paths to Delta tables as s3://
bucket
/prefix
/object
.
How to configure a crawler to use Lake Formation credentials
You can configure a crawler to use AWS Lake Formation credentials to access an Amazon S3 data store or a Data Catalog table with an underlying Amazon S3 location within the same AWS account or another AWS account. You can configure an existing Data Catalog table as a crawler's target, if the crawler and the Data Catalog table reside in the same account. Currently, only a single catalog target with a single catalog table is allowed when using a Data Catalog table as a cralwer's target.
Note
When you are defining a Data Catalog table as a crawler target, make sure that the underlying location of the Data Catalog table is an Amazon S3 location. Crawlers that use Lake Formation credentials only support Data Catalog targets with underlying Amazon S3 locations.
Setup required when the crawler and registered Amazon S3 location or Data Catalog table reside in the same account (in-account crawling)
To allow the crawler to access a data store or Data Catalog table by using Lake Formation credentials, you need to register the data location with Lake Formation. Also, the crawler's IAM role must have permissions to read the data from the destination where the Amazon S3 bucket is registered.
You can complete the following configuration steps using the AWS Management Console or AWS Command Line Interface (AWS CLI).
Setup required when the crawler and registered Amazon S3 location reside in different accounts (cross-account crawling)
To allow the crawler to access a data store in a different account using Lake Formation credentials, you must first register the Amazon S3 data location with Lake Formation. Then, you grant data location permissions to the crawler's account by taking the following steps.
You can complete the following steps using the AWS Management Console or AWS CLI.
Note
A crawler using Lake Formation credentials is only supported for Amazon S3 and Data Catalog targets.
For targets using Lake Formation credential vending, the underlying Amazon S3 locations must belong to the same bucket. For example, customers can use multiple targets (s3://bucket1/folder1, s3://bucket1/folder2) as long as all target locations are under the same bucket (bucket1). Specifying different buckets (s3://bucket1/folder1, s3://bucket2/folder2) is not allowed.
Currently for Data Catalog target crawlers, only a single catalog target with a single catalog table is allowed.