Menu
AWS Glue
Developer Guide

Crawler API

Data Types

Crawler Structure

Specifies a crawler program that examines a data source and uses classifiers to try to its schema. If successful, the crawler records metatdata concerning the data source in the Data Catalog.

Fields

  • Name – String, matching the Single-line string pattern.

    The Crawler name.

  • Role – String, matching the AWS ARN string pattern.

    The ARN of an IAM role used to access customer resources such as data in S3.

  • Targets – A CrawlerTargets object.

    A collection of targets to crawl.

  • DatabaseName – String.

    The Database where this Crawler's output should be stored.

  • Description – Description string, matching the URI address multi-line string pattern.

    A description of this Crawler and where it should be used.

  • Classifiers – An array of UTF-8 strings.

    A list of custom Classifiers associated with this Crawler.

  • SchemaChangePolicy – A SchemaChangePolicy object.

    Sets policy for the crawler's update and delete behavior.

  • State – String (valid values: READY | RUNNING | STOPPING).

    Indicates whether this Crawler is running, or whether a run is pending.

  • TablePrefix – String.

    The table prefix used for catalog tables created.

  • Schedule – A Schedule object.

    A Schedule object that specifies the schedule on which this Crawler is to be run.

  • CrawlElapsedTime – Number (long).

    If this Crawler is running, contains the total time elapsed since the last crawl began.

  • CreationTime – Timestamp.

    The time when the Crawler was created.

  • LastUpdated – Timestamp.

    The time the Crawler was last updated.

  • LastCrawl – A LastCrawlInfo object.

    The status of the last crawl, and potentially error information if an error occurred.

  • Version – Number (long).

    The version of the Crawler.

Schedule Structure

A scheduling object using a cron statement to schedule an event.

Fields

  • ScheduleExpression – String.

    A cron expression used to specify the schedule (see Time-Based Schedules for Jobs and Crawlers. For example, to run something every day at 12:15 UTC, you would specify: cron(15 12 * * ? *).

  • State – String (valid values: SCHEDULED | NOT_SCHEDULED | TRANSITIONING).

    The state of the schedule.

CrawlerTargets Structure

Specifies crawler targets.

Fields

  • S3Targets – An array of S3Targets.

    Specifies targets in AWS S3.

  • JdbcTargets – An array of JdbcTargets.

    Specifies JDBC targets.

S3Target Structure

Specifies a crawler target in AWS S3.

Fields

  • Path – String.

    The path to the S3 target.

  • Exclusions – An array of UTF-8 strings.

    A list of S3 objects to exclude from the crawl.

JdbcTarget Structure

Specifies a JDBC target for a crawl.

Fields

  • ConnectionName – String.

    The name of the connection to use for the JDBC target.

  • Path – String.

    The path of the JDBC target.

  • Exclusions – An array of UTF-8 strings.

    A list of items to exclude from the crawl.

CrawlerMetrics Structure

Metrics for a specified crawler.

Fields

  • CrawlerName – String, matching the Single-line string pattern.

    The name of the crawler.

  • TimeLeftSeconds – Number (double).

    The estimated time left to complete a running crawl.

  • StillEstimating – Boolean.

    True if the crawler is estimating its

  • LastRuntimeSeconds – Number (double).

    The duration of the crawler's most recent run, in seconds.

  • MedianRuntimeSeconds – Number (double).

    The median duration of this crawler's runs, in seconds.

  • TablesCreated – Number (integer).

    A list of the tables created by this crawler.

  • TablesUpdated – Number (integer).

    A list of the tables created by this crawler.

  • TablesDeleted – Number (integer).

    A list of the tables deleted by this crawler.

SchemaChangePolicy Structure

Crawler policy for update and deletion behavior.

Fields

  • UpdateBehavior – String (valid values: LOG | UPDATE_IN_DATABASE).

    The update behavior.

  • DeleteBehavior – String (valid values: LOG | DELETE_FROM_DATABASE | DEPRECATE_IN_DATABASE).

    The deletion behavior.

LastCrawlInfo Structure

Status and error information about the most recent crawl.

Fields

  • Status – String (valid values: SUCCEEDED | CANCELLED | FAILED).

    Status of the last crawl.

  • ErrorMessage – Description string, matching the URI address multi-line string pattern.

    Error information about the last crawl, if an error occurred.

  • LogGroup – String, matching the Log group string pattern.

    The log group for the last crawl.

  • LogStream – String, matching the Log-stream string pattern.

    The log stream for the last crawl.

  • MessagePrefix – String, matching the Single-line string pattern.

    The prefix for a message about this crawl.

  • StartTime – Timestamp.

    The time at which the crawl started.

Operations

CreateCrawler Action (Python: create_crawler)

Creates a new Crawler with specified targets, role, configuration, and optional schedule. At least one crawl target must be specified, in either the s3Targets or the jdbcTargets field.

Request

  • Name – String, matching the Single-line string pattern. Required.

    Name of the new Crawler.

  • Role – String, matching the AWS ARN string pattern. Required.

    The AWS ARN of the IAM role used by the new Crawler to access customer resources.

  • DatabaseName – String. Required.

    The Glue Database where results will be stored, such as: arn:aws:daylight:us-east-1::database/sometable/*.

  • Description – Description string, matching the URI address multi-line string pattern.

    A description of the new Crawler.

  • Targets – A CrawlerTargets object. Required.

    A list of collection of targets to crawl.

  • Schedule – String.

    A cron expression used to specify the schedule (see Time-Based Schedules for Jobs and Crawlers. For example, to run something every day at 12:15 UTC, you would specify: cron(15 12 * * ? *).

  • Classifiers – An array of UTF-8 strings.

    A list of custom Classifier names that the user has registered. By default, all AWS classifiers are included in a crawl, but these custom classifiers always override the default classifiers for a given classification.

  • TablePrefix – String.

    The table prefix used for catalog tables created.

  • SchemaChangePolicy – A SchemaChangePolicy object.

    Policy for the crawler's update and deletion behavior.

Response

  • No Response parameters.

Errors

  • InvalidInputException

  • AlreadyExistsException

  • OperationTimeoutException

  • ResourceNumberLimitExceededException

DeleteCrawler Action (Python: delete_crawler)

Removes a specified Crawler from the metadata store, unless the Crawler state is RUNNING.

Request

Response

  • No Response parameters.

Errors

  • EntityNotFoundException

  • CrawlerRunningException

  • SchedulerTransitioningException

  • OperationTimeoutException

GetCrawler Action (Python: get_crawler)

Retrieves metadata for a specified Crawler.

Request

Response

  • Crawler – A Crawler object.

    The metadata for the specified Crawler.

Errors

  • EntityNotFoundException

  • OperationTimeoutException

GetCrawlers Action (Python: get_crawlers)

Retrieves metadata for all Crawlers defined in the customer account.

Request

  • MaxResults – Number (integer).

    The number of Crawlers to return on each call.

  • NextToken – String.

    A continuation token, if this is a continuation request.

Response

  • Crawlers – An array of Crawlers.

    A list of Crawler metadata.

  • NextToken – String.

    A continuation token, if the returned list has not reached the end of those defined in this customer account.

Errors

  • OperationTimeoutException

GetCrawlerMetrics Action (Python: get_crawler_metrics)

Retrieves metrics about specified crawlers.

Request

  • CrawlerNameList – An array of UTF-8 strings.

    A list of the names of crawlers about which to retrieve metrics.

  • MaxResults – Number (integer).

    The maximum size of a list to return.

  • NextToken – String.

    A continuation token, if this is a continuation call.

Response

  • CrawlerMetricsList – An array of CrawlerMetricss.

    A list of metrics for the specified crawler.

  • NextToken – String.

    A continuation token, if the returned list does not contain the last metric available.

Errors

  • OperationTimeoutException

UpdateCrawler Action (Python: update_crawler)

Updates a Crawler. If a Crawler is running, you must stop it using StopCrawler before updating it.

Request

  • Name – String, matching the Single-line string pattern. Required.

    Name of the new Crawler.

  • Role – String, matching the AWS ARN string pattern.

    The AWS ARN of the IAM role used by the new Crawler to access customer resources.

  • DatabaseName – String.

    The Glue Database where results will be stored, such as: arn:aws:daylight:us-east-1::database/sometable/*.

  • Description – String, matching the URI address multi-line string pattern.

    A description of the new Crawler.

  • Targets – A CrawlerTargets object.

    A list of collection of targets to crawl.

  • Schedule – String.

    A cron expression used to specify the schedule (see Time-Based Schedules for Jobs and Crawlers. For example, to run something every day at 12:15 UTC, you would specify: cron(15 12 * * ? *).

  • Classifiers – An array of UTF-8 strings.

    A list of custom Classifier names that the user has registered. By default, all AWS classifiers are included in a crawl, but these custom classifiers always override the default classifiers for a given classification.

  • TablePrefix – String.

    The table prefix used for catalog tables created.

  • SchemaChangePolicy – A SchemaChangePolicy object.

    Policy for the crawler's update and deletion behavior.

Response

  • No Response parameters.

Errors

  • InvalidInputException

  • VersionMismatchException

  • EntityNotFoundException

  • CrawlerRunningException

  • OperationTimeoutException

StartCrawler Action (Python: start_crawler)

Starts a crawl using the specified Crawler, regardless of what is scheduled. If the Crawler is already running, does nothing.

Request

Response

  • No Response parameters.

Errors

  • EntityNotFoundException

  • CrawlerRunningException

  • OperationTimeoutException

StopCrawler Action (Python: stop_crawler)

If the specified Crawler is running, stops the crawl.

Request

Response

  • No Response parameters.

Errors

  • EntityNotFoundException

  • CrawlerNotRunningException

  • CrawlerStoppingException

  • OperationTimeoutException