Using AWS Glue to Connect to Data Sources in Amazon S3 - Amazon Athena

Using AWS Glue to Connect to Data Sources in Amazon S3

Athena can connect to your data stored in Amazon S3 using the AWS Glue Data Catalog to store metadata such as table and column names. After the connection is made, your databases, tables, and views appear in Athena's query editor.

To define schema information for AWS Glue to use, you can create an AWS Glue crawler to retrieve the information automatically, or you can manually add a table and enter the schema information.

Creating an AWS Glue Crawler

You can create a crawler by starting in the Athena console and then using the AWS Glue console in an integrated way. When you create the crawler, you specify a data location in Amazon S3 to crawl.

To create a crawler in AWS Glue starting from the Athena console

  1. Open the Athena console at https://console.aws.amazon.com/athena/.

  2. In the query editor, next to Tables and views, choose Create, and then choose AWS Glue crawler.

  3. On the AWS Glue console Add crawler page, follow the steps to create a crawler. For more information, see Using AWS Glue Crawlers in this guide and Populating the AWS Glue Data Catalog in the AWS Glue Developer Guide.

Note

Athena does not recognize exclude patterns that you specify for an AWS Glue crawler. For example, if you have an Amazon S3 bucket that contains both .csv and .json files and you exclude the .json files from the crawler, Athena queries both groups of files. To avoid this, place the files that you want to exclude in a different location.

Adding a Table Using a Form

The following procedure shows you how to use the Athena console to add a table using the Create Table From S3 bucket data form.

To add a table and enter schema information using a form

  1. Open the Athena console at https://console.aws.amazon.com/athena/.

  2. In the query editor, next to Tables and views, choose Create, and then choose S3 bucket data.

  3. On the Create Table From S3 bucket data form, for Table name, enter a name for the table.

  4. For Database configuration, choose an existing database, or create a new one.

  5. For Location of Input Data Set, specify the path in Amazon S3 to the folder that contains the dataset that you want to process.

  6. For Data Format, choose a data format (Apache Web Logs, CSV, TSV, Text File with Custom Delimiters, JSON, Parquet, or ORC).

    • For the Apache Web Logs option, you must also enter a regex expression in the Regex box.

    • For the Text File with Custom Delimiters option, specify a Field terminator (that is, a column delimiter). Optionally, you can specify a Collection terminator for array types or a Map key terminator.

  7. For Column details, specify a column name and the column data type.

    • To add more columns one at a time, choose Add a column.

    • To quickly add more columns, choose Bulk add columns. In the text box, enter a comma separated list of columns in the format column_name data_type, column_name data_type[, …], and then choose Add.

  8. (Optional) For Partition details, add one or more column names and data types.

  9. The Preview table query box shows the CREATE TABLE statement generated by the information that you entered into the form. The preview statement cannot be edited directly. To change the statement, modify the fields in the form, or create the statement directly in the query editor instead of using the form.

  10. Choose Create table to run the generated statement in the query editor and create the table.