Working with Crawlers on the AWS Glue Console - AWS Glue

Working with Crawlers on the AWS Glue Console

A crawler accesses your data store, extracts metadata, and creates table definitions in the AWS Glue Data Catalog. The Crawlers pane in the AWS Glue console lists all the crawlers that you create. The list displays status and metrics from the last run of your crawler.

To add a crawler using the console

  1. Sign in to the AWS Management Console and open the AWS Glue console at https://console.aws.amazon.com/glue/. Choose Crawlers in the navigation pane.

  2. Choose Add crawler, and follow the instructions in the Add crawler wizard.

    Note

    To get step-by-step guidance for adding a crawler, choose Add crawler under Tutorials in the navigation pane. You can also use the Add crawler wizard to create and modify an IAM role that attaches a policy that includes permissions for your Amazon Simple Storage Service (Amazon S3) data stores.

    Optionally, you can tag your crawler with a Tag key and optional Tag value. Once created, tag keys are read-only. Use tags on some resources to help you organize and identify them. For more information, see AWS Tags in AWS Glue.

    Optionally, you can add a security configuration to a crawler to specify at-rest encryption options.

When a crawler runs, the provided IAM role must have permission to access the data store that is crawled.

When you crawl a JDBC data store, a connection is required. For more information, see Adding an AWS Glue Connection. An exclude path is relative to the include path. For example, to exclude a table in your JDBC data store, type the table name in the exclude path.

When you crawl DynamoDB tables, you can choose one table name from the list of DynamoDB tables in your account.

Tip

For more information about configuring crawlers, see Crawler Properties.

Viewing Crawler Results and Details

After the crawler runs successfully, it creates table definitions in the Data Catalog. Choose Tables in the navigation pane to see the tables that were created by your crawler in the database that you specified.

You can view information related to the crawler itself as follows:

  • The Crawlers page on the AWS Glue console displays the following properties for a crawler:

    Property Description
    Name

    When you create a crawler, you must give it a unique name.

    Schedule

    You can choose to run your crawler on demand or choose a frequency with a schedule. For more information about scheduling a crawler, see Scheduling a Crawler.

    Status

    A crawler can be ready, starting, stopping, scheduled, or schedule paused. A running crawler progresses from starting to stopping. You can resume or pause a schedule attached to a crawler.

    Logs

    Links to any available logs from the last run of the crawler.

    Last runtime

    The amount of time it took the crawler to run when it last ran.

    Median runtime

    The median amount of time it took the crawler to run since it was created.

    Tables updated

    The number of tables in the AWS Glue Data Catalog that were updated by the latest run of the crawler.

    Tables added

    The number of tables that were added into the AWS Glue Data Catalog by the latest run of the crawler.

  • To view the actions and log messages for a crawler, choose Crawlers in the navigation pane to see the crawlers you created. Find the crawler name in the list and choose the Logs link. This link takes you to the CloudWatch Logs, where you can see details about which tables were created in the AWS Glue Data Catalog and any errors that were encountered.

    You can manage your log retention period in the CloudWatch console. The default log retention is Never Expire. For more information about how to change the retention period, see Change Log Data Retention in CloudWatch Logs.

    For more information about viewing the log information, see Automated Monitoring Tools in this guide and Querying AWS CloudTrail Logs in the Amazon Athena User Guide. Also, see the blog Easily query AWS service logs using Amazon Athena for information about how to use the Athena Glue Service Logs (AGSlogger) Python library in conjunction with AWS Glue ETL jobs to enable a common framework for processing log data.

  • To see detailed information for a crawler, choose the crawler name in the list. Crawler details include the information you defined when you created the crawler with the Add crawler wizard.