Amazon Athena
User Guide  | API Reference

Best Practices When Using Athena with AWS Glue

When using Athena with the AWS Glue Data Catalog, you can use AWS Glue to create databases and tables (schema) to be queried in Athena, or you can use Athena to create schema and then use them in AWS Glue and related services. This topic provides considerations and best practices when using either method.

Under the hood, Athena uses Presto to execute DML statements and Hive to execute the DDL statements that create and modify schema. With these technologies, there are a couple conventions to follow so that Athena and AWS Glue work well together.

In this topic

Database, Table, and Column Names#

When you create schema in AWS Glue to query in Athena, consider the following:

  • A database name cannot be longer than 252 characters.
  • A table name cannot be longer than 255 characters.
  • A column name cannot be longer than 128 characters.
  • The only acceptable characters for database names, table names, and column names are lowercase letters, numbers, and the underscore character.

You can use the AWS Glue Catalog Manager to rename columns, but at this time table names and database names cannot be changed using the AWS Glue console. To correct database names, you need to create a new database and copy tables to it (in other words, copy the metadata to a new entity). You can follow a similar process for tables. You can use the AWS Glue SDK or AWS CLI to do this.

Using AWS Glue Crawlers#

AWS Glue crawlers help discover and register the schema for datasets in the AWS Glue Data Catalog. The crawlers go through your data, and inspect portions of it to determine the schema. In addition, the crawler can detect and register partitions. For more information, see Cataloging Data with a Crawler in the AWS Glue Developer Guide.

Scheduling a Crawler to Keep the AWS Glue Data Catalog and Amazon S3 in Sync#

AWS Glue crawlers can be set up to run on a schedule or on demand. For more information, see Time-Based Schedules for Jobs and Crawlers in the AWS Glue Developer Guide.

If you have data that arrives for a partitioned table at a fixed time, you can set up an AWS Glue crawler to run on schedule to detect and update table partitions. This can eliminate the need to run a potentially long and expensive MSCK REPAIR command or manually execute an ALTER TABLE ADD PARTITION command. For more information, see Table Partitions in the AWS Glue Developer Guide.

_images/glue_crawler.png

Using Multiple Data Sources with Crawlers#

When an AWS Glue crawler scans Amazon S3 and detects multiple directories, it uses a heuristic to determine where the root for a table is in the directory structure, and which directories are partitions for the table. In some cases, where the schema detected in two or more directories is similar, the crawler may treat them as partitions instead of separate tables. One way to help the crawler discover individual tables is to add each table's root directory as a data store for the crawler.

The following partitions in Amazon S3 are an example:

s3://bucket1/folder1/table1/partition1/file.txt
s3://bucket1/folder1/table1/partition2/file.txt
s3://bucket1/folder1/table1/partition3/file.txt
s3://bucket1/folder1/table2/partition4/file.txt
s3://bucket1/folder1/table2/partition5/file.txt

If the schema for table1 and table2 are similar, and a single data source is set to s3://bucket1/folder1/ in AWS Glue, the crawler may create a single table with two partition columns: one partition column that contains table1 and table2, and a second partition column that contains partition1 through partition5.

To have the AWS Glue crawler create two separate tables as intended, use the AWS Glue console to set the crawler to have two data sources, s3://bucket1/folder1/table1/ and s3://bucket1/folder1/table2, as shown in the following procedure.

  1. In the AWS Glue console, choose Crawlers, select your crawler, and then choose Action, Edit crawler.

    _images/glue_add_data_store0.png
  2. Under Add information about your crawler, choose additional settings as appropriate, and then choose Next.

  3. Under Add a data store, change Include path to the table-level directory. For instance, given the example above, you would change it from s3://bucket1/folder1 to s3://bucket1/folder1/table1/. Choose Next.

    _images/glue_add_data_store1.png
  4. For Add another data store, choose Yes, Next.

  5. For Include path, enter your other table-level directory (for example, s3://bucket1/folder1/table2/) and choose Next.

    1. Repeat steps 3-5 for any additional table-level directories, and finish the crawler configuration.

The new values for Include locations appear under data stores

_images/glue_add_data_store2.png

Syncing Partition Schema to Avoid "HIVE_PARTITION_SCHEMA_MISMATCH"#

For each table within the AWS Glue Data Catalog that has partition columns, the schema is stored at the table level and for each individual partition within the table. The schema for partitions are populated by an AWS Glue crawler based on the sample of data that it reads within the partition. For more information, see schema_crawlers_data_sources.

When Athena runs a query, it validates the schema of the table and the schema of any partitions necessary for the query. The validation compares the column data types in order and makes sure that they match for the columns that overlap. This prevents unexpected operations such as adding or removing columns from the middle of a table. If Athena detects that the schema of a partition differs from the schema of the table, Athena may not be able to process the query and fails with HIVE_PARTITION_SCHEMA_MISMATCH.

There are a few ways to fix this issue. First, if the data was accidentally added, you can remove the data files that cause the difference in schema, drop the partition, and re-crawl the data. Second, you can drop the individual partition and then run MSCK REPAIR within Athena to re-create the partition using the table's schema. This second option works only if you are confident that the schema applied will continue to read the data correctly.

Updating Table Metadata#

After a crawl, the AWS Glue crawler automatically assigns certain table metadata to help make it compatible with other external technologies like Apache Hive, Presto, and Spark. Occasionally, the crawler may incorrectly assign metadata properties. Manually correct the properties in AWS Glue before querying the table using Athena. For more information, see Viewing and Editing Table Details in the AWS Glue Developer Guide.

AWS Glue may mis-assign metadata when a CSV file has quotes around each data field, getting the serializationLib property wrong. For more information, see CSV Data Enclosed in quotes.

Working with CSV Files#

CSV files occasionally have quotes around the data values intended for each column, and there may be header values included in CSV files, which aren't part of the data to be analyzed. When you use AWS Glue to create schema from these files, follow the guidance in this section.

CSV Data Enclosed in Quotes#

If you run a query in Athena against a table created from a CSV file with quoted data values, update the table definition in AWS Glue so that it specifies the right SerDe and SerDe properties. This allows the table definition to use the OpenCSVSerDe. For more information about the OpenCSV SerDe, see CSV (OpenCSVSerDe).

In this case, you need to change the serializationLib property under field in the SerDeInfo field in the table to org.apache.hadoop.hive.serde2.OpenCSVSerde and enter appropriate values for separatorChar, quoteChar, and escapeChar.

For example, for a CSV file with records such as the following:

"John","Doe","123-555-1231","John said \"hello\""
"Jane","Doe","123-555-9876","Jane said \"hello\""

The separatorChar``value is a comma, the ``quoteChar value is double quotes, and the escapeChar value is the backslash.

You can use the AWS Glue console to edit table details as shown in this example:

_images/glue_edit_serde.png

Alternatively, you can update the table definition in AWS Glue to have a SerDeInfo block such as the following:

"SerDeInfo": {
   "name": "",
   "serializationLib": "org.apache.hadoop.hive.serde2.OpenCSVSerde",
   "parameters": {
      "separatorChar": ","
      "quoteChar": """
      "escapeChar": "\\"
      }
},

For more information, see Viewing and Editing Table Details in the AWS Glue Developer Guide.

CSV Files with Headers#

If you are writing CSV files from AWS Glue to query using Athena, you must remove the CSV headers so that the header information is not included in Athena query results. One way to achieve this is to use AWS Glue jobs, which perform extract, transform, and load (ETL) work. You can write scripts in AWS Glue using a language that is an extension of the PySpark Python dialect. For more information, see Authoring Jobs in Glue in the AWS Glue Developer Guide.

The following example shows a function in an AWS Glue script that writes out a dynamic frame using from_options, and sets the writeHeader format option to false, which removes the header information:

glueContext.write_dynamic_frame.from_options(frame = applymapping1, connection_type = "s3", connection_options = {"path": "s3://MYBUCKET/MYTABLEDATA/"}, format = "csv", format_options = {"writeHeader": False}, transformation_ctx = "datasink2")

Using AWS Glue Jobs for ETL with Athena#

AWS Glue jobs perform ETL operations. An AWS Glue job runs a script that extracts data from sources, transforms the data, and loads it into targets. For more information, see Authoring Jobs in Glue in the AWS Glue Developer Guide.

Creating Tables Using Athena for AWS Glue ETL Jobs#

Tables that you create from within Athena must have a table property added to them called a classification, which identifies the format of the data. This allows AWS Glue to be able to use the tables for ETL jobs. The classification values can be csv, parquet, orc, avro, or json. An example create table statement in Athena follows:

CREATE EXTERNAL TABLE sampleTable (
  column1 INT,
  column2 INT
  ) STORED AS PARQUET
  TBLPROPERTIES (
  'classification'='parquet')

If the table property was not added when the table was created, the property can be added using the AWS Glue console.

  1. Choose Edit Table.
    _images/glue_edit_table.png
  2. For Classification, select the file type and choose Apply.
    _images/glue_edit_table_classification.png

For more information, see Working with Tables in the AWS Glue Developer Guide.

Using ETL Jobs to Optimize Query Performance#

AWS Glue jobs can help you transform data to a format that optimizes query performance in Athena. Data formats have a large impact on query performance and query costs in Athena.

We recommend the Parquet and ORC formats. AWS Glue supports writing to both of these data formats, which can make it easier and faster for you to transform data to an optimal format for Athena. For more information about these formats and other ways to improve performance, see Top Performance Tuning tips for Amazon Athena.

Converting SMALLINT and TINYINT Datatypes to INT When Converting to ORC#

To reduce the likelihood that Athena is unable to read the SMALLINT and TINYINT data types produced by an AWS Glue ETL job, convert SMALLINT and TINYINT to INT when using the wizard or writing a script for an ETL job.

Changing Date Data Types to String for Parquet ETL Transformation#

Athena currently does not support the DATE data type for Parquet files. Convert DATE data types to STRING when using the wizard or writing a script for an AWS Glue ETL job.

Automating AWS Glue Jobs for ETL#

You can configure AWS Glue ETL jobs to run automatically based on triggers. This feature is ideal when data from outside AWS is being pushed to an S3 bucket in a suboptimal format for querying in Athena. For more information, see Triggering AWS Glue Jobs in the AWS Glue Developer Guide.