AWS Glue
Developer Guide

Crawler API

Data Types

Crawler Structure

Specifies a crawler program that examines a data source and uses classifiers to try to determine its schema. If successful, the crawler records metadata concerning the data source in the AWS Glue Data Catalog.

Fields

  • Name – UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line string pattern.

    The crawler name.

  • Role – UTF-8 string.

    The IAM role (or ARN of an IAM role) used to access customer resources, such as data in Amazon S3.

  • Targets – A CrawlerTargets object.

    A collection of targets to crawl.

  • DatabaseName – UTF-8 string.

    The database where metadata is written by this crawler.

  • Description – Description string, not more than 2048 bytes long, matching the URI address multi-line string pattern.

    A description of the crawler.

  • Classifiers – An array of UTF-8 strings.

    A list of custom classifiers associated with the crawler.

  • SchemaChangePolicy – A SchemaChangePolicy object.

    Sets the behavior when the crawler finds a changed or deleted object.

  • State – UTF-8 string (valid values: READY | RUNNING | STOPPING).

    Indicates whether the crawler is running, or whether a run is pending.

  • TablePrefix – UTF-8 string, not more than 128 bytes long.

    The prefix added to the names of tables that are created.

  • Schedule – A Schedule object.

    For scheduled crawlers, the schedule when the crawler runs.

  • CrawlElapsedTime – Number (long).

    If the crawler is running, contains the total time elapsed since the last crawl began.

  • CreationTime – Timestamp.

    The time when the crawler was created.

  • LastUpdated – Timestamp.

    The time the crawler was last updated.

  • LastCrawl – A LastCrawlInfo object.

    The status of the last crawl, and potentially error information if an error occurred.

  • Version – Number (long).

    The version of the crawler.

  • Configuration – UTF-8 string.

    Crawler configuration information. This versioned JSON string allows users to specify aspects of a crawler's behavior. For more information, see Configuring a Crawler.

  • CrawlerSecurityConfiguration – UTF-8 string, not more than 128 bytes long.

    The name of the SecurityConfiguration structure to be used by this Crawler.

Schedule Structure

A scheduling object using a cron statement to schedule an event.

Fields

  • ScheduleExpression – UTF-8 string.

    A cron expression used to specify the schedule (see Time-Based Schedules for Jobs and Crawlers. For example, to run something every day at 12:15 UTC, you would specify: cron(15 12 * * ? *).

  • State – UTF-8 string (valid values: SCHEDULED | NOT_SCHEDULED | TRANSITIONING).

    The state of the schedule.

CrawlerTargets Structure

Specifies data stores to crawl.

Fields

  • S3Targets – An array of S3Target objects.

    Specifies Amazon S3 targets.

  • JdbcTargets – An array of JdbcTarget objects.

    Specifies JDBC targets.

  • DynamoDBTargets – An array of DynamoDBTarget objects.

    Specifies DynamoDB targets.

S3Target Structure

Specifies a data store in Amazon S3.

Fields

  • Path – UTF-8 string.

    The path to the Amazon S3 target.

  • Exclusions – An array of UTF-8 strings.

    A list of glob patterns used to exclude from the crawl. For more information, see Catalog Tables with a Crawler.

JdbcTarget Structure

Specifies a JDBC data store to crawl.

Fields

  • ConnectionName – UTF-8 string.

    The name of the connection to use to connect to the JDBC target.

  • Path – UTF-8 string.

    The path of the JDBC target.

  • Exclusions – An array of UTF-8 strings.

    A list of glob patterns used to exclude from the crawl. For more information, see Catalog Tables with a Crawler.

DynamoDBTarget Structure

Specifies a DynamoDB table to crawl.

Fields

  • Path – UTF-8 string.

    The name of the DynamoDB table to crawl.

CrawlerMetrics Structure

Metrics for a specified crawler.

Fields

  • CrawlerName – UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line string pattern.

    The name of the crawler.

  • TimeLeftSeconds – Number (double), not more than None.

    The estimated time left to complete a running crawl.

  • StillEstimating – Boolean.

    True if the crawler is still estimating how long it will take to complete this run.

  • LastRuntimeSeconds – Number (double), not more than None.

    The duration of the crawler's most recent run, in seconds.

  • MedianRuntimeSeconds – Number (double), not more than None.

    The median duration of this crawler's runs, in seconds.

  • TablesCreated – Number (integer), not more than None.

    The number of tables created by this crawler.

  • TablesUpdated – Number (integer), not more than None.

    The number of tables updated by this crawler.

  • TablesDeleted – Number (integer), not more than None.

    The number of tables deleted by this crawler.

SchemaChangePolicy Structure

Crawler policy for update and deletion behavior.

Fields

  • UpdateBehavior – UTF-8 string (valid values: LOG | UPDATE_IN_DATABASE).

    The update behavior when the crawler finds a changed schema.

  • DeleteBehavior – UTF-8 string (valid values: LOG | DELETE_FROM_DATABASE | DEPRECATE_IN_DATABASE).

    The deletion behavior when the crawler finds a deleted object.

LastCrawlInfo Structure

Status and error information about the most recent crawl.

Fields

  • Status – UTF-8 string (valid values: SUCCEEDED | CANCELLED | FAILED).

    Status of the last crawl.

  • ErrorMessage – Description string, not more than 2048 bytes long, matching the URI address multi-line string pattern.

    If an error occurred, the error information about the last crawl.

  • LogGroup – UTF-8 string, not less than 1 or more than 512 bytes long, matching the Log group string pattern.

    The log group for the last crawl.

  • LogStream – UTF-8 string, not less than 1 or more than 512 bytes long, matching the Log-stream string pattern.

    The log stream for the last crawl.

  • MessagePrefix – UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line string pattern.

    The prefix for a message about this crawl.

  • StartTime – Timestamp.

    The time at which the crawl started.

Operations

CreateCrawler Action (Python: create_crawler)

Creates a new crawler with specified targets, role, configuration, and optional schedule. At least one crawl target must be specified, in the s3Targets field, the jdbcTargets field, or the DynamoDBTargets field.

Request

  • NameRequired: UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line string pattern.

    Name of the new crawler.

  • RoleRequired: UTF-8 string.

    The IAM role (or ARN of an IAM role) used by the new crawler to access customer resources.

  • DatabaseNameRequired: UTF-8 string.

    The AWS Glue database where results are written, such as: arn:aws:daylight:us-east-1::database/sometable/*.

  • Description – Description string, not more than 2048 bytes long, matching the URI address multi-line string pattern.

    A description of the new crawler.

  • TargetsRequired: A CrawlerTargets object.

    A list of collection of targets to crawl.

  • Schedule – UTF-8 string.

    A cron expression used to specify the schedule (see Time-Based Schedules for Jobs and Crawlers. For example, to run something every day at 12:15 UTC, you would specify: cron(15 12 * * ? *).

  • Classifiers – An array of UTF-8 strings.

    A list of custom classifiers that the user has registered. By default, all built-in classifiers are included in a crawl, but these custom classifiers always override the default classifiers for a given classification.

  • TablePrefix – UTF-8 string, not more than 128 bytes long.

    The table prefix used for catalog tables that are created.

  • SchemaChangePolicy – A SchemaChangePolicy object.

    Policy for the crawler's update and deletion behavior.

  • Configuration – UTF-8 string.

    Crawler configuration information. This versioned JSON string allows users to specify aspects of a crawler's behavior. For more information, see Configuring a Crawler.

  • CrawlerSecurityConfiguration – UTF-8 string, not more than 128 bytes long.

    The name of the SecurityConfiguration structure to be used by this Crawler.

Response

  • No Response parameters.

Errors

  • InvalidInputException

  • AlreadyExistsException

  • OperationTimeoutException

  • ResourceNumberLimitExceededException

DeleteCrawler Action (Python: delete_crawler)

Removes a specified crawler from the Data Catalog, unless the crawler state is RUNNING.

Request

  • NameRequired: UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line string pattern.

    Name of the crawler to remove.

Response

  • No Response parameters.

Errors

  • EntityNotFoundException

  • CrawlerRunningException

  • SchedulerTransitioningException

  • OperationTimeoutException

GetCrawler Action (Python: get_crawler)

Retrieves metadata for a specified crawler.

Request

  • NameRequired: UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line string pattern.

    Name of the crawler to retrieve metadata for.

Response

  • Crawler – A Crawler object.

    The metadata for the specified crawler.

Errors

  • EntityNotFoundException

  • OperationTimeoutException

GetCrawlers Action (Python: get_crawlers)

Retrieves metadata for all crawlers defined in the customer account.

Request

  • MaxResults – Number (integer), not less than 1 or more than 1000.

    The number of crawlers to return on each call.

  • NextToken – UTF-8 string.

    A continuation token, if this is a continuation request.

Response

  • Crawlers – An array of Crawler objects.

    A list of crawler metadata.

  • NextToken – UTF-8 string.

    A continuation token, if the returned list has not reached the end of those defined in this customer account.

Errors

  • OperationTimeoutException

GetCrawlerMetrics Action (Python: get_crawler_metrics)

Retrieves metrics about specified crawlers.

Request

  • CrawlerNameList – An array of UTF-8 strings, not more than 100 strings.

    A list of the names of crawlers about which to retrieve metrics.

  • MaxResults – Number (integer), not less than 1 or more than 1000.

    The maximum size of a list to return.

  • NextToken – UTF-8 string.

    A continuation token, if this is a continuation call.

Response

  • CrawlerMetricsList – An array of CrawlerMetrics objects.

    A list of metrics for the specified crawler.

  • NextToken – UTF-8 string.

    A continuation token, if the returned list does not contain the last metric available.

Errors

  • OperationTimeoutException

UpdateCrawler Action (Python: update_crawler)

Updates a crawler. If a crawler is running, you must stop it using StopCrawler before updating it.

Request

  • NameRequired: UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line string pattern.

    Name of the new crawler.

  • Role – UTF-8 string.

    The IAM role (or ARN of an IAM role) used by the new crawler to access customer resources.

  • DatabaseName – UTF-8 string.

    The AWS Glue database where results are stored, such as: arn:aws:daylight:us-east-1::database/sometable/*.

  • Description – UTF-8 string, not more than 2048 bytes long, matching the URI address multi-line string pattern.

    A description of the new crawler.

  • Targets – A CrawlerTargets object.

    A list of targets to crawl.

  • Schedule – UTF-8 string.

    A cron expression used to specify the schedule (see Time-Based Schedules for Jobs and Crawlers. For example, to run something every day at 12:15 UTC, you would specify: cron(15 12 * * ? *).

  • Classifiers – An array of UTF-8 strings.

    A list of custom classifiers that the user has registered. By default, all built-in classifiers are included in a crawl, but these custom classifiers always override the default classifiers for a given classification.

  • TablePrefix – UTF-8 string, not more than 128 bytes long.

    The table prefix used for catalog tables that are created.

  • SchemaChangePolicy – A SchemaChangePolicy object.

    Policy for the crawler's update and deletion behavior.

  • Configuration – UTF-8 string.

    Crawler configuration information. This versioned JSON string allows users to specify aspects of a crawler's behavior. For more information, see Configuring a Crawler.

  • CrawlerSecurityConfiguration – UTF-8 string, not more than 128 bytes long.

    The name of the SecurityConfiguration structure to be used by this Crawler.

Response

  • No Response parameters.

Errors

  • InvalidInputException

  • VersionMismatchException

  • EntityNotFoundException

  • CrawlerRunningException

  • OperationTimeoutException

StartCrawler Action (Python: start_crawler)

Starts a crawl using the specified crawler, regardless of what is scheduled. If the crawler is already running, returns a CrawlerRunningException.

Request

  • NameRequired: UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line string pattern.

    Name of the crawler to start.

Response

  • No Response parameters.

Errors

  • EntityNotFoundException

  • CrawlerRunningException

  • OperationTimeoutException

StopCrawler Action (Python: stop_crawler)

If the specified crawler is running, stops the crawl.

Request

  • NameRequired: UTF-8 string, not less than 1 or more than 255 bytes long, matching the Single-line string pattern.

    Name of the crawler to stop.

Response

  • No Response parameters.

Errors

  • EntityNotFoundException

  • CrawlerNotRunningException

  • CrawlerStoppingException

  • OperationTimeoutException