Setting crawler configuration options - AWS Glue

Setting crawler configuration options

When a crawler runs, it might encounter changes to your data store that result in a schema or partition that is different from a previous crawl. You can use the AWS Management Console or the AWS Glue API to configure how your crawler processes certain types of changes.

Setting crawler configuration options on the AWS Glue console

When you define a crawler using the AWS Glue console, you have several options for configuring the behavior of your crawler. For more information about using the AWS Glue console to add a crawler, see Working with crawlers on the AWS Glue console.

When a crawler runs against a previously crawled data store, it might discover that a schema has changed or that some objects in the data store have been deleted. The crawler logs changes to a schema. Depending on the source type for the crawler, new tables and partitions might be created regardless of the schema change policy.

To specify what the crawler does when it finds changes in the schema, you can choose one of the following actions on the console:

  • Update the table definition in the Data Catalog – Add new columns, remove missing columns, and modify the definitions of existing columns in the AWS Glue Data Catalog. Remove any metadata that is not set by the crawler. This is the default setting.

  • Add new columns only – For tables that map to an Amazon S3 data store, add new columns as they are discovered, but don't remove or change the type of existing columns in the Data Catalog. Choose this option when the current columns in the Data Catalog are correct and you don't want the crawler to remove or change the type of the existing columns. If a fundamental Amazon S3 table attribute changes, such as classification, compression type, or CSV delimiter, mark the table as deprecated. Maintain input format and output format as they exist in the Data Catalog. Update SerDe parameters only if the parameter is one that is set by the crawler. For all other data stores, modify existing column definitions.

  • Ignore the change and don't update the table in the Data Catalog – Only new tables and partitions are created.

    This is the default setting for incremental crawls.

A crawler might also discover new or changed partitions. By default, new partitions are added and existing partitions are updated if they have changed. In addition, you can set a crawler configuration option to Update all new and existing partitions with metadata from the table on the AWS Glue console. When this option is set, partitions inherit metadata properties—such as their classification, input format, output format, SerDe information, and schema—from their parent table. Any changes to these properties in a table are propagated to its partitions. When this configuration option is set on an existing crawler, existing partitions are updated to match the properties of their parent table the next time the crawler runs.

To specify what the crawler does when it finds a deleted object in the data store, choose one of the following actions:

  • Delete tables and partitions from the Data Catalog

  • Ignore the change and don't update the table in the Data Catalog

    This is the default setting for incremental crawls.

  • Mark the table as deprecated in the Data Catalog – This is the default setting.

Setting crawler configuration options using the API

When you define a crawler using the AWS Glue API, you can choose from several fields to configure your crawler. The SchemaChangePolicy in the crawler API determines what the crawler does when it discovers a changed schema or a deleted object. The crawler logs schema changes as it runs.

When a crawler runs, new tables and partitions are always created regardless of the schema change policy. You can choose one of the following actions in the UpdateBehavior field in the SchemaChangePolicy structure to determine what the crawler does when it finds a changed table schema:

  • UPDATE_IN_DATABASE – Update the table in the AWS Glue Data Catalog. Add new columns, remove missing columns, and modify the definitions of existing columns. Remove any metadata that is not set by the crawler.

  • LOG – Ignore the changes, and don't update the table in the Data Catalog.

    This is the default setting for incremental crawls.

You can also override the SchemaChangePolicy structure using a JSON object supplied in the crawler API Configuration field. This JSON object can contain a key-value pair to set the policy to not update existing columns and only add new columns. For example, provide the following JSON object as a string:

{ "Version": 1.0, "CrawlerOutput": { "Tables": { "AddOrUpdateBehavior": "MergeNewColumns" } } }

This option corresponds to the Add new columns only option on the AWS Glue console. It overrides the SchemaChangePolicy structure for tables that result from crawling Amazon S3 data stores only. Choose this option if you want to maintain the metadata as it exists in the Data Catalog (the source of truth). New columns are added as they are encountered, including nested data types. But existing columns are not removed, and their type is not changed. If an Amazon S3 table attribute changes significantly, mark the table as deprecated, and log a warning that an incompatible attribute needs to be resolved.

When a crawler runs against a previously crawled data store, it might discover new or changed partitions. By default, new partitions are added and existing partitions are updated if they have changed. In addition, you can set a crawler configuration option to InheritFromTable (corresponding to the Update all new and existing partitions with metadata from the table option on the AWS Glue console). When this option is set, partitions inherit metadata properties from their parent table, such as their classification, input format, output format, SerDe information, and schema. Any property changes to the parent table are propagated to its partitions.

When this configuration option is set on an existing crawler, existing partitions are updated to match the properties of their parent table the next time the crawler runs. This behavior is set crawler API Configuration field. For example, provide the following JSON object as a string:

{ "Version": 1.0, "CrawlerOutput": { "Partitions": { "AddOrUpdateBehavior": "InheritFromTable" } } }

The crawler API Configuration field can set multiple configuration options. For example, to configure the crawler output for both partitions and tables, you can provide a string representation of the following JSON object:

{ "Version": 1.0, "CrawlerOutput": { "Partitions": { "AddOrUpdateBehavior": "InheritFromTable" }, "Tables": {"AddOrUpdateBehavior": "MergeNewColumns" } } }

You can choose one of the following actions to determine what the crawler does when it finds a deleted object in the data store. The DeleteBehavior field in the SchemaChangePolicy structure in the crawler API sets the behavior of the crawler when it discovers a deleted object.

  • DELETE_FROM_DATABASE – Delete tables and partitions from the Data Catalog.

  • LOG – Ignore the change. Don't update the Data Catalog. Write a log message instead.

  • DEPRECATE_IN_DATABASE – Mark the table as deprecated in the Data Catalog. This is the default setting.

How to prevent the crawler from changing an existing schema

If you don't want a crawler to overwrite updates you made to existing fields in an Amazon S3 table definition, choose the option on the console to Add new columns only or set the configuration option MergeNewColumns. This applies to tables and partitions, unless Partitions.AddOrUpdateBehavior is overridden to InheritFromTable.

If you don't want a table schema to change at all when a crawler runs, set the schema change policy to LOG. You can also set a configuration option that sets partition schemas to inherit from the table.

If you are configuring the crawler on the console, you can choose the following actions:

  • Ignore the change and don't update the table in the Data Catalog

  • Update all new and existing partitions with metadata from the table

When you configure the crawler using the API, set the following parameters:

  • Set the UpdateBehavior field in SchemaChangePolicy structure to LOG.

  • Set the Configuration field with a string representation of the following JSON object in the crawler API; for example:

    { "Version": 1.0, "CrawlerOutput": { "Partitions": { "AddOrUpdateBehavior": "InheritFromTable" } } }

How to create a single schema for each Amazon S3 include path

By default, when a crawler defines tables for data stored in Amazon S3, it considers both data compatibility and schema similarity. Data compatibility factors that it considers include whether the data is of the same format (for example, JSON), the same compression type (for example, GZIP), the structure of the Amazon S3 path, and other data attributes. Schema similarity is a measure of how closely the schemas of separate Amazon S3 objects are similar.

You can configure a crawler to CombineCompatibleSchemas into a common table definition when possible. With this option, the crawler still considers data compatibility, but ignores the similarity of the specific schemas when evaluating Amazon S3 objects in the specified include path.

If you are configuring the crawler on the console, to combine schemas, select the crawler option Create a single schema for each S3 path.

When you configure the crawler using the API, set the following configuration option:

  • Set the Configuration field with a string representation of the following JSON object in the crawler API; for example:

    { "Version": 1.0, "Grouping": { "TableGroupingPolicy": "CombineCompatibleSchemas" } }

To help illustrate this option, suppose that you define a crawler with an include path s3://bucket/table1/. When the crawler runs, it finds two JSON files with the following characteristics:

  • File 1S3://bucket/table1/year=2017/data1.json

  • File content{“A”: 1, “B”: 2}

  • SchemaA:int, B:int

  • File 2S3://bucket/table1/year=2018/data2.json

  • File content{“C”: 3, “D”: 4}

  • SchemaC: int, D: int

By default, the crawler creates two tables, named year_2017 and year_2018 because the schemas are not sufficiently similar. However, if the option Create a single schema for each S3 path is selected, and if the data is compatible, the crawler creates one table. The table has the schema A:int,B:int,C:int,D:int and partitionKey year:string.

How to specify the table location and partitioning level

By default, when a crawler defines tables for data stored in Amazon S3 the crawler attempts to merge schemas together and create top-level tables (year=2019). In some cases, you may expect the crawler to create a table for the folder month=Jan but instead the crawler creates a partition since a sibling folder (month=Mar) was merged into the same table.

The table level crawler option provides you the flexibility to tell the crawler where the tables are located, and how you want partitions created. When you specify a Table level, the table is created at that absolute level from the Amazon S3 bucket.


        Crawler grouping with table level specified as level 2.

When configuring the crawler on the console, you can specify a value for the Table level crawler option. The value must be a positive integer that indicates the table location (the absolute level in the dataset). The level for the top level folder is 1. For example, for the path mydataset/a/b, if the level is set to 3, the table is created at location mydataset/a/b.

Console

              Specifying a table level in the crawler configuration.
API

When you configure the crawler using the API, set the Configuration field with a string representation of the following JSON object; for example:

configuration = jsonencode( { "Version": 1.0, "Grouping" = { TableLevelConfiguration = 2 } })
CloudFormation

In this example, you set the Table level option available in the console within your CloudFormation template:

"Configuration": "{ \"Version\":1.0, \"Grouping\":{\"TableLevelConfiguration\":2} }"

How to specify the maximum number of tables the crawler is allowed to create

You can optionally specify the maximum number of tables the crawler is allowed to create by specifying a TableThreshold via the AWS Glue console or CLI. If the tables detected by the crawler during its crawl is greater that this input value, the crawl fails and no data is written to the Data Catalog.

This parameter is useful when the tables that would be detected and created by the crawler are much greater more than what you expect. There can be multiple reasons for this, such as:

  • When using an AWS Glue job to populate your Amazon S3 locations you can end up with empty files at the same level as a folder. In such cases when you run a crawler on this Amazon S3 location, the crawler creates multiple tables due to files and folders present at the same level.

  • If you do not configure "TableGroupingPolicy": "CombineCompatibleSchemas" you may end up with more tables than expected.

You specify the TableThreshold as an integer value greater than 0. This value is configured on a per crawler basis. That is, for every crawl this value is considered. For example: a crawler has the TableThreshold value set as 5. In each crawl AWS Glue compares the number of tables detected with this table threshold value (5) and if the number of tables detected is less than 5, AWS Glue writes the tables to the Data Catalog and if not, the crawl fails without writing to the Data Catalog.

Console

To set TableThreshold using the AWS console:


          The Output and scheduling section of the AWS console showing the Maximum table threshold parameter.

CLI

To set TableThreshold using the AWS CLI:

"{"Version":1.0, "CrawlerOutput": {"Tables":{"AddOrUpdateBehavior":"MergeNewColumns", "TableThreshold":5}}}";

Error messages are logged to help you identify table paths and clean-up your data. Example log in your account if the crawler fails because the table count was greater than table threshold value provided:

Table Threshold value = 28, Tables detected - 29

In CloudWatch, we log all table locations detected as an INFO message. An error is logged as the reason for the failure.

ERROR com.amazonaws.services.glue.customerLogs.CustomerLogService - CustomerLogService received CustomerFacingException with message The number of tables detected by crawler: 29 is greater than the table threshold value provided: 28. Failing crawler without writing to Data Catalog. com.amazonaws.services.glue.exceptions.CustomerFacingInternalException: The number of tables detected by crawler: 29 is greater than the table threshold value provided: 28. Failing crawler without writing to Data Catalog.

How to specify configuration options for a Delta Lake data store

When you configure a crawler for a Delta Lake data store, you specify these configuration parameters:

Connection

Optionally select or add a Network connection to use with this Amazon S3 target. For information about connections, see Defining connections in the AWS Glue Data Catalog.

Enable write manifest

Select whether to detect table metadata or schema changes in the Delta Lake transaction log; it regenerates the manifest file. You should not choose this option if you configured an automatic manifest update with Delta Lake SET TBLPROPERTIES.

Include delta lake table path(s)

Specify one or more Amazon S3 paths to Delta tables as s3://bucket/prefix/object.


              Specifying crawling a Delta Lake data store.

How to configure a crawler to use Lake Formation credentials

This feature is in preview release and is subject to change. For more information, see the Betas and Previews section in the AWS Service Terms document.

You can configure a crawler to use AWS Lake Formation credentials to access an Amazon S3 data store or a Data Catalog table with an underlying Amazon S3 location within the same AWS account or another AWS account. You can configure an existing Data Catalog table as a crawler's target, if the crawler and the Data Catalog table reside in the same account. Currently, only a single catalog target with a single catalog table is allowed when using a Data Catalog table as a cralwer's target.

Note

When you are defining a Data Catalog table as a crawler target, make sure that the underlying location of the Data Catalog table is an Amazon S3 location. Crawlers that use Lake Formation credentials only support Data Catalog targets with underlying Amazon S3 locations.

Setup required when the crawler and registered Amazon S3 location or Data Catalog table reside in the same account (in-account crawling)

To allow the crawler to access a data store or Data Catalog table by using Lake Formation credentials, you need to register the data location with Lake Formation. Also, the crawler's IAM role must have permissions to read the data from the destination where the Amazon S3 bucket is registered.

You can complete the following configuration steps using the AWS Management Console or AWS Command Line Interface (AWS CLI).

AWS Management Console
  1. Before configuring a crawler to access the crawler source, register the data location of the data store or the Data Catalog with Lake Formation. In the Lake Formation console (https://console.aws.amazon.com/lakeformation/), register an Amazon S3 location as the root location of your data lake in the AWS account where the crawler is defined. For more information, see Registering an Amazon S3 location.

  2. Grant Data location permissions to the IAM role that's used for the crawler run so that the crawler can read the data from the destination in Lake Formation. For more information, see Granting data location permissions (same account).

  3. Grant the crawler role access permissions (Create, Describe, Alter) to the database, which is specified as the output database. For more information, see Granting database permissions using the Lake Formation console and the named resource method.

  4. In the IAM console (https://console.aws.amazon.com/iam/), create an IAM role for the crawler. Add the lakeformation:GetDataAccess policy to the role.

  5. In the AWS Glue console (https://console.aws.amazon.com/glue/), while configuring the crawler, select the option Use Lake Formation credentials for crawling Amazon S3 data source.

    Note

    The accountId field is optional for in-account crawling.

AWS CLI
aws glue --profile demo create-crawler --debug --cli-input-json '{ "Name": "prod-test-crawler", "Role": "arn:aws:iam::111122223333:role/service-role/AWSGlueServiceRole-prod-test-run-role", "DatabaseName": "prod-run-db", "Description": "", "Targets": { "S3Targets":[ { "Path": "s3://crawl-testbucket" } ] }, "SchemaChangePolicy": { "UpdateBehavior": "LOG", "DeleteBehavior": "LOG" }, "RecrawlPolicy": { "RecrawlBehavior": "CRAWL_EVERYTHING" }, "LineageConfiguration": { "CrawlerLineageSettings": "DISABLE" }, "LakeFormationConfiguration": { "UseLakeFormationCredentials": true, "AccountId": "111122223333" }, "Configuration": { "Version": 1.0, "CrawlerOutput": { "Partitions": { "AddOrUpdateBehavior": "InheritFromTable" }, "Tables": {"AddOrUpdateBehavior": "MergeNewColumns" } }, "Grouping": { "TableGroupingPolicy": "CombineCompatibleSchemas" } }, "CrawlerSecurityConfiguration": "", "Tags": { "KeyName": "" } }'

Setup required when the crawler and registered Amazon S3 location reside in different accounts (cross-account crawling)

To allow the crawler to access a data store in a different account using Lake Formation credentials, you must first register the Amazon S3 data location with Lake Formation. Then, you grant data location permissions to the crawler's account by taking the following steps.

You can complete the following steps using the AWS Management Console or AWS CLI.

AWS Management Console
  1. In the account where the Amazon S3 location is registered (account B):

    1. Register an Amazon S3 path with Lake Formation. For more information, see Registering Amazon S3 location.

    2. Grant Data location permissions to the account (account A) where the crawler will be run. For more information, see Grant data location permissions.

    3. Create an empty database in Lake Formation with the underlying location as the target Amazon S3 location. For more information, see Creating a database.

    4. Grant account A (the account where the crawler will be run) access to the database that you created in the previous step. For more information, see Granting database permissions.

  2. In the account where the crawler is created and will be run (account A):

    1. Using the AWS RAM console, accept the database that was shared from the external account (account B). For more information, see Accepting a resource share invitation from AWS Resource Access Manager.

    2. Create an IAM role for the crawler. Add lakeformation:GetDataAccess policy to the role.

    3. In the Lake Formation console (https://console.aws.amazon.com/lakeformation/), grant Data location permissions on the target Amazon S3 location to the IAM role used for the crawler run so that the crawler can read the data from the destination in Lake Formation. For more information, see Granting data location permissions.

    4. Create a resource link on the shared database. For more information, see Create a resource link.

    5. Grant the crawler role access permissions (Create, Describe, Alter) on the shared database and the resource link. The resource link is specified in the output for the crawler.

    6. In the AWS Glue console (https://console.aws.amazon.com/glue/), while configuring the crawler, select the option Use Lake Formation credentials for crawling Amazon S3 data source.

      For cross-account crawling, specify the AWS account ID where the target Amazon S3 location is registered with Lake Formation. For in-account crawling, the accountId field is optional.

AWS CLI
aws glue --profile demo create-crawler --debug --cli-input-json '{ "Name": "prod-test-crawler", "Role": "arn:aws:iam::111122223333:role/service-role/AWSGlueServiceRole-prod-test-run-role", "DatabaseName": "prod-run-db", "Description": "", "Targets": { "S3Targets":[ { "Path": "s3://crawl-testbucket" } ] }, "SchemaChangePolicy": { "UpdateBehavior": "LOG", "DeleteBehavior": "LOG" }, "RecrawlPolicy": { "RecrawlBehavior": "CRAWL_EVERYTHING" }, "LineageConfiguration": { "CrawlerLineageSettings": "DISABLE" }, "LakeFormationConfiguration": { "UseLakeFormationCredentials": true, "AccountId": "111111111111" }, "Configuration": { "Version": 1.0, "CrawlerOutput": { "Partitions": { "AddOrUpdateBehavior": "InheritFromTable" }, "Tables": {"AddOrUpdateBehavior": "MergeNewColumns" } }, "Grouping": { "TableGroupingPolicy": "CombineCompatibleSchemas" } }, "CrawlerSecurityConfiguration": "", "Tags": { "KeyName": "" } }'
Note
  • A crawler using Lake Formation credentials is only supported for Amazon S3 and Data Catalog targets.

  • For targets using Lake Formation credential vending, the underlying Amazon S3 locations must belong to the same bucket. For example, customers can use multiple targets (s3://bucket1/folder1, s3://bucket1/folder2) as long as all target locations are under the same bucket (bucket1). Specifying different buckets (s3://bucket1/folder1, s3://bucket2/folder2) is not allowed.

  • Currently for Data Catalog target crawlers, only a single catalog target with a single catalog table is allowed.