Working with crawlers on the AWS Glue console - AWS Glue

Working with crawlers on the AWS Glue console

A crawler accesses your data store, extracts metadata, and creates table definitions in the AWS Glue Data Catalog. The Crawlers pane in the AWS Glue console lists all the crawlers that you create. The list displays status and metrics from the last run of your crawler.

Note

If you choose to bring in your own JDBC driver versions, AWS Glue crawlers will consume resources in AWS Glue jobs and Amazon S3 buckets to ensure your provided driver are run in your environment. The additional usage of resources will be reflected in your account. Additionally, providing your own JDBC driver does not mean that the crawler is able to leverage all of the driver's features. Drivers are limited to the properties described in Adding an AWS Glue connection.

To add a crawler using the console
  1. Sign in to the AWS Management Console and open the AWS Glue console at https://console.aws.amazon.com/glue/. Choose Crawlers in the navigation pane.

  2. Choose Create crawler, and follow the instructions in the Add crawler wizard. The wizard will guide you through the following steps.

    1. Set crawler properties. Enter a name for your crawler and description (optional).

      Optionally, you can tag your crawler with a Tag key and optional Tag value. Once created, tag keys are read-only. Use tags on some resources to help you organize and identify them. For more information, see AWS tags in AWS Glue.

    2. Choose data sources and classifiers. In Data source configuration, choose 'Not yet' or 'Yes' to answer the question 'Is your data mapped to AWS Glue tables? By default, 'Not yet' is selected.

      If your data is already mapped to AWS Glue tables, choose Add a data source. For more information, see Adding an AWS Glue connection.

      In the Add data source window, choose your data source and choose the appropriate options for your data source.

      (Optional) If you choose JDBC as the data source, you can use your own JDBC drivers when specifying the Connection access where the driver info is stored.

    3. Configure security settings. Choose an existing IAM role or create a new IAM role.

      Note

      In order to add your own JDBC driver, additional permissions need to be added. For more information, see

      • Grant permissions for the following job actions: CreateJob, DeleteJob, GetJob, GetJobRun, StartJobRun.

      • Grant permissions for Amazon S3 actions: s3:DeleteObjects, s3:GetObject, s3:ListBucket, s3:PutObject.

        Note

        The s3:ListBucket is not needed if the Amazon S3 bucket policy is disabled.

      • Grant service principal access to bucket/folder in the Amazon S3 policy.

      Example Amazon S3 policy:

      { "Version": "2012-10-17", "Statement": [ { "Sid": "VisualEditor0", "Effect": "Allow", "Action": [ "s3:PutObject", "s3:GetObject", "s3:ListBucket", "s3:DeleteObject" ], "Resource": [ "arn:aws:s3:::bucket-name/driver-parent-folder/driver.jar", "arn:aws:s3:::bucket-name" ] } ] }

      AWS Glue creates the following folders (_crawler and _glue_job_crawler at the same level as the JDBC driver in your Amazon S3 bucket. For example, if the driver path is <s3-path/driver_folder/driver.jar>, then the following folders will be created if they do not already exist:

      • <s3-path/driver_folder/_crawler>

      • <s3-path/driver_folder/_glue_job_crawler>

      Optionally, you can add a security configuration to a crawler to specify at-rest encryption options.

    4. Set output and scheduling. You can choose the target database, add a prefix to add to table names, and set a maximum table threshold (optional).

      When selecting a crawler schedule, choose the frequency.

    5. Review and create. Choose Edit to make changes to any of the steps in the wizard. When done, choose Create crawler.

When you crawl DynamoDB tables, you can choose one table name from the list of DynamoDB tables in your account.

Tip

For more information about configuring crawlers, see Crawler properties.

Viewing Crawler Results and Details

After the crawler runs successfully, it creates table definitions in the Data Catalog. Choose Tables in the navigation pane to see the tables that were created by your crawler in the database that you specified.

You can view information related to the crawler itself as follows:

  • The Crawlers page on the AWS Glue console displays the following properties for a crawler:

    Property Description
    Name

    When you create a crawler, you must give it a unique name.

    Status

    A crawler can be ready, starting, stopping, scheduled, or schedule paused. A running crawler progresses from starting to stopping. You can resume or pause a schedule attached to a crawler.

    Schedule

    You can choose to run your crawler on demand or choose a frequency with a schedule. For more information about scheduling a crawler, see Scheduling a crawler.

    Last run

    The date and time of the last time the crawler was run.

    Log

    Links to any available logs from the last run of the crawler.

    Tables changes from last run

    The number of tables in the AWS Glue Data Catalog that were updated by the latest run of the crawler.

  • To view the history of a crawler, choose Crawlers in the navigation pane to see the crawlers you created. Choose a crawler from the list of available crawlers. You can view the crawler properties and view the crawler history in the Crawler runs tab.

    The Crawler runs tab displays information about each time the crawler ran, including Start time (UTC), End time (UTC), Duration, Status, DPU hours, and Table changes.

    The Crawler runs tab displays only the crawls that have occurred since the launch date of the crawler history feature, and only retains up to 12 months of crawls. Older crawls will not be returned.

  • To see additional information, choose a tab in the crawler details page. Each tab will display information related to the crawler.

    • Schedule: Any schedules created for the crawler will be visible here.

    • Data sources: All data sources scanned by the crawler will be visible here.

    • Classifiers: All classifiers assigned to the crawler will be visible here.

    • Tags: Any tags created and assigned to an AWS resource will be visible here.