Using AWS Glue to Connect to Data Sources in Amazon S3 - Amazon Athena

Using AWS Glue to Connect to Data Sources in Amazon S3

Athena can connect to your data stored in Amazon S3 using the AWS Glue Data Catalog to store metadata such as table and column names. After the connection is made, your databases, tables, and views appear in Athena's query editor.

To define schema information for AWS Glue to use, you can set up an AWS Glue crawler to retrieve the information automatically, or you can manually add a table and enter the schema information.

Setting up a Crawler

You set up a crawler by starting in the Athena console and then using the AWS Glue console in an integrated way. When you create a crawler, you can choose data stores to crawl or point the crawler to existing catalog tables.

Note

The steps for setting up a crawler depend on the options available in the Athena console. If the Connect data source link in Option A is not available, use the procedure in Option B.

Option A

Option A: To set up a crawler in AWS Glue using the Connect data source link

  1. Open the Athena console at https://console.aws.amazon.com/athena/.

  2. Choose Connect data source. If the Connect data source link is not present, use Option B.

    
                            Choose Connect data source.
  3. On the Connect data source page, choose AWS Glue Data Catalog.

  4. Click Next.

  5. On the Connection details page, choose Set up crawler in AWS Glue to retrieve schema information automatically.

  6. Click Connect to AWS Glue.

  7. On the AWS Glue console Add crawler page, follow the steps to create a crawler.

    For more information, see Populating the AWS Glue Data Catalog.

Option B

Use the following procedure to set up a AWS Glue crawler if the Connect data source link in Option A is not available in the Athena console.

Option B: To set up a crawler in AWS Glue from the AWS Glue Data Catalog link

  1. Open the Athena console at https://console.aws.amazon.com/athena/.

  2. Choose AWS Glue Data Catalog.

    
                            Choose AWS Glue Data Catalog.
  3. On the AWS Glue console Tables page, choose Add tables using a crawler.

    
                            Choose Add tables using a
                                    crawler.
  4. On the AWS Glue console Add crawler page, follow the steps to create a crawler.

    For more information, see Populating the AWS Glue Data Catalog.

Note

Athena does not recognize exclude patterns that you specify for an AWS Glue crawler. For example, if you have an Amazon S3 bucket that contains both .csv and .json files and you exclude the .json files from the crawler, Athena queries both groups of files. To avoid this, place the files that you want to exclude in a different location.

Adding a Schema Table Manually

The following procedure shows you how to use the Athena console to add a table manually.

To add a table and enter schema information manually

  1. Open the Athena console at https://console.aws.amazon.com/athena/.

  2. Choose Connect data source.

  3. On the Connect data source page, choose AWS Glue Data Catalog.

  4. Click Next.

  5. On the Connection details page, choose Add a table and enter schema information manually.

  6. Click Continue to add table.

  7. On the Add table page of the Athena console, for Database, choose an existing database or create a new one.

  8. Enter or choose a table name.

  9. For Location of Input Data Set, specify the path in Amazon S3 to the folder that contains the dataset that you want to process.

  10. Click Next.

  11. For Data Format, choose a data format (Apache Web Logs, CSV, TSV, Text File with Custom Delimiters, JSON, Parquet, or ORC).

    • For the Apache Web Logs option, you must also enter a regex expression in the Regex box.

    • For the Text File with Custom Delimiters option, specify a Field terminator (that is, a column delimiter). Optionally, you can specify a Collection terminator for array types or a Map key terminator.

  12. For Columns, specify a column name and the column data type.

    • To add more columns one at a time, choose Add a column.

    • To quickly add more columns, choose Bulk add columns. In the text box, enter a comma separated list of columns in the format column_name data_type, column_name data_type[, …], and then choose Add.

  13. Choose Next.

  14. (Optional) For Partitions, click Add a partition to add column names and data types.

  15. Choose Create table. The DDL for the table that you specified appears in the Query Editor. The following example shows the DDL generated for a two-column table in CSV format:

    CREATE EXTERNAL TABLE IF NOT EXISTS MyManualDB.MyManualTable ( `cola` string, `colb` string ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' WITH SERDEPROPERTIES ( 'serialization.format' = ',', 'field.delim' = ',' ) LOCATION 's3://bucket_name/' TBLPROPERTIES ('has_encrypted_data'='false');
  16. Choose Run query to create the table.