Supported SerDes and data formats - Amazon Athena

Supported SerDes and data formats

Athena supports creating tables and querying data from CSV, TSV, custom-delimited, and JSON formats; data from Hadoop-related formats: ORC, Apache Avro and Parquet; logs from Logstash, AWS CloudTrail logs, and Apache WebServer logs.

Note

The formats listed in this section are used by Athena for reading data. For information about formats that Athena uses for writing data when it runs CTAS queries, see Creating a table from query results (CTAS).

To create tables and query data in these formats in Athena, specify a serializer-deserializer class (SerDe) so that Athena knows which format is used and how to parse the data.

This table lists the data formats supported in Athena and their corresponding SerDe libraries.

A SerDe is a custom library that tells the data catalog used by Athena how to handle the data. You specify a SerDe type by listing it explicitly in the ROW FORMAT part of your CREATE TABLE statement in Athena. In some cases, you can omit the SerDe name because Athena uses some SerDe types by default for certain types of data formats.

Supported data formats and SerDes
Data format Description SerDe types supported in Athena
Amazon Ion Amazon Ion is a richly-typed, self-describing data format that is a superset of JSON, developed and open-sourced by Amazon. Use the Amazon Ion Hive SerDe.

Apache Avro

A format for storing data in Hadoop that uses JSON-based schemas for record values.

Use the Avro SerDe.

Apache Parquet

A format for columnar storage of data in Hadoop.

Use the Parquet SerDe and SNAPPY compression.

Apache WebServer logs

A format for storing logs in Apache WebServer.

Use the Grok SerDe or Regex SerDe.

CloudTrail logs

A format for storing logs in CloudTrail.

CSV (Comma-Separated Values)

For data in CSV, each line represents a data record, and each record consists of one or more fields, separated by commas.

Custom-Delimited

For data in this format, each line represents a data record, and records are separated by a custom single-character delimiter.

Use the LazySimpleSerDe for CSV, TSV, and custom-delimited files and specify a custom single-character delimiter.

JSON (JavaScript Object Notation)

For JSON data, each line represents a data record, and each record consists of attribute-value pairs and arrays, separated by commas.

Logstash logs

A format for storing logs in Logstash.

Use the Grok SerDe.

ORC (Optimized Row Columnar)

A format for optimized columnar storage of Hive data.

Use the ORC SerDe and ZLIB compression.

TSV (Tab-Separated Values)

For data in TSV, each line represents a data record, and each record consists of one or more fields, separated by tabs.

Use the LazySimpleSerDe for CSV, TSV, and custom-delimited files and specify the separator character as FIELDS TERMINATED BY '\t'.