AWS CloudFormation
User Guide (Version )

AWS::Glue::Crawler

The AWS::Glue::Crawler resource specifies an AWS Glue crawler. For more information, see Cataloging Tables with a Crawler and Crawler Structure in the AWS Glue Developer Guide.

Syntax

To declare this entity in your AWS CloudFormation template, use the following syntax:

JSON

{ "Type" : "AWS::Glue::Crawler", "Properties" : { "Classifiers" : [ String, ... ], "Configuration" : String, "CrawlerSecurityConfiguration" : String, "DatabaseName" : String, "Description" : String, "Name" : String, "Role" : String, "Schedule" : Schedule, "SchemaChangePolicy" : SchemaChangePolicy, "TablePrefix" : String, "Tags" : Json, "Targets" : Targets } }

YAML

Type: AWS::Glue::Crawler Properties: Classifiers: - String Configuration: String CrawlerSecurityConfiguration: String DatabaseName: String Description: String Name: String Role: String Schedule: Schedule SchemaChangePolicy: SchemaChangePolicy TablePrefix: String Tags: Json Targets: Targets

Properties

Classifiers

A list of UTF-8 strings that specify the custom classifiers that are associated with the crawler.

Required: No

Type: List of String

Update requires: No interruption

Configuration

Crawler configuration information. This versioned JSON string allows users to specify aspects of a crawler's behavior. For more information, see Configuring a Crawler.

Required: No

Type: String

Update requires: No interruption

CrawlerSecurityConfiguration

The name of the SecurityConfiguration structure to be used by this crawler.

Required: No

Type: String

Update requires: No interruption

DatabaseName

The name of the database in which the crawler's output is stored.

Required: Yes

Type: String

Update requires: No interruption

Description

A description of the crawler.

Required: No

Type: String

Update requires: No interruption

Name

The name of the crawler.

Required: No

Type: String

Update requires: Replacement

Role

The Amazon Resource Name (ARN) of an IAM role that's used to access customer resources, such as Amazon Simple Storage Service (Amazon S3) data.

Required: Yes

Type: String

Update requires: No interruption

Schedule

For scheduled crawlers, the schedule when the crawler runs.

Required: No

Type: Schedule

Update requires: No interruption

SchemaChangePolicy

The policy that specifies update and delete behaviors for the crawler.

Required: No

Type: SchemaChangePolicy

Update requires: No interruption

TablePrefix

The prefix added to the names of tables that are created.

Required: No

Type: String

Update requires: No interruption

Tags

The tags to use with this crawler request. You can use tags to limit access to the crawler. For more information about tags in AWS Glue, see AWS Tags in AWS Glue in the developer guide.

Required: No

Type: Json

Update requires: No interruption

Targets

A collection of targets to crawl.

Required: Yes

Type: Targets

Update requires: No interruption

Return Values

Ref

When you pass the logical ID of this resource to the intrinsic Ref function, Ref returns the crawler name.

For more information about using the Ref function, see Ref.

Examples

The following example creates a crawler for an Amazon S3 target.

JSON

{ "Description": "AWS Glue Crawler Test", "Resources": { "MyRole": { "Type": "AWS::IAM::Role", "Properties": { "AssumeRolePolicyDocument": { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Service": [ "glue.amazonaws.com" ] }, "Action": [ "sts:AssumeRole" ] } ] }, "Path": "/", "Policies": [ { "PolicyName": "root", "PolicyDocument": { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": "*", "Resource": "*" } ] } } ] } }, "MyDatabase": { "Type": "AWS::Glue::Database", "Properties": { "CatalogId": { "Ref": "AWS::AccountId" }, "DatabaseInput": { "Name": "dbCrawler", "Description": "TestDatabaseDescription", "LocationUri": "TestLocationUri", "Parameters": { "key1": "value1", "key2": "value2" } } } }, "MyClassifier": { "Type": "AWS::Glue::Classifier", "Properties": { "GrokClassifier": { "Name": "CrawlerClassifier", "Classification": "wikiData", "GrokPattern": "%{NOTSPACE:language} %{NOTSPACE:page_title} %{NUMBER:hits:long} %{NUMBER:retrieved_size:long}" } } }, "MyS3Bucket": { "Type": "AWS::S3::Bucket", "Properties": { "BucketName": "crawlertesttarget", "AccessControl": "BucketOwnerFullControl" } }, "MyCrawler2": { "Type": "AWS::Glue::Crawler", "Properties": { "Name": "testcrawler1", "Role": { "Fn::GetAtt": [ "MyRole", "Arn" ] }, "DatabaseName": { "Ref": "MyDatabase" }, "Classifiers": [ { "Ref": "MyClassifier" } ], "Targets": { "S3Targets": [ { "Path": { "Ref": "MyS3Bucket" } } ] }, "SchemaChangePolicy": { "UpdateBehavior": "UPDATE_IN_DATABASE", "DeleteBehavior": "LOG" }, "Schedule": { "ScheduleExpression": "cron(0/10 * ? * MON-FRI *)" } } } } }

YAML

Resources: MyRole: Type: AWS::IAM::Role Properties: AssumeRolePolicyDocument: Version: "2012-10-17" Statement: - Effect: "Allow" Principal: Service: - "glue.amazonaws.com" Action: - "sts:AssumeRole" Path: "/" Policies: - PolicyName: "root" PolicyDocument: Version: "2012-10-17" Statement: - Effect: "Allow" Action: "*" Resource: "*" MyDatabase: Type: AWS::Glue::Database Properties: CatalogId: !Ref AWS::AccountId DatabaseInput: Name: "dbCrawler" Description: "TestDatabaseDescription" LocationUri: "TestLocationUri" Parameters: key1 : "value1" key2 : "value2" MyClassifier: Type: AWS::Glue::Classifier Properties: GrokClassifier: Name: "CrawlerClassifier" Classification: "wikiData" GrokPattern: "%{NOTSPACE:language} %{NOTSPACE:page_title} %{NUMBER:hits:long} %{NUMBER:retrieved_size:long}" MyS3Bucket: Type: AWS::S3::Bucket Properties: BucketName: "crawlertesttarget" AccessControl: "BucketOwnerFullControl" MyCrawler2: Type: AWS::Glue::Crawler Properties: Name: "testcrawler1" Role: !GetAtt MyRole.Arn DatabaseName: !Ref MyDatabase Classifiers: - !Ref MyClassifier Targets: S3Targets: - Path: !Ref MyS3Bucket SchemaChangePolicy: UpdateBehavior: "UPDATE_IN_DATABASE" DeleteBehavior: "LOG" Schedule: ScheduleExpression: "cron(0/10 * ? * MON-FRI *)"

Crawler Configuration

The following example specifies a configuration that controls a crawler's behavior.

JSON

{ "Type": "AWS::Glue::Crawler", "Properties": { "Role": "role1", "Classifiers": [], "Description": "example classifier", "SchemaChangePolicy": "", "Schedule": "Schedule", "DatabaseName": "test", "Targets": [], "TablePrefix": "test-", "Name": "my-crawler", "Configuration": "{\"Version\":1.0,\"CrawlerOutput\":{\"Partitions\":{\"AddOrUpdateBehavior\":\"InheritFromTable\"},\"Tables\":{\"AddOrUpdateBehavior\":\"MergeNewColumns\"}}}" } }