Athena Compression Support - Amazon Athena

Athena Compression Support

Athena supports a variety of compression formats for reading and writing data, including reading from a table that uses multiple compression formats. For example, Athena can successfully read the data in a table that uses Parquet file format when some Parquet files are compressed with Snappy and other Parquet files are compressed with GZIP. The same principle applies for ORC, Textfile, and JSON storage formats.

Athena supports the following compression formats:

  • BZIP2 – Format that uses the Burrows-Wheeler algorithm.

    Note

    In rare cases, a known issue in Athena engine version 1 can cause records to be silently dropped when the BZIP2 format is used. For this reason, use of the BZIP2 format in Athena engine version 1 is not recommended.

  • DEFLATE – Compression algorithm based on LZSS and Huffman coding. Deflate is relevant only for the Avro file format.

  • GZIP – Compression algorithm based on Deflate. GZIP is the default write compression format for files in the Parquet and Textfile storage formats. Files in the tar.gz format are not supported.

  • LZ4 – This member of the Lempel-Ziv 77 (LZ7) family also focuses on compression and decompression speed rather than maximum compression of data. LZ4 has the following framing formats:

    • LZ4 Raw/Unframed – An unframed, standard implementation of the LZ4 block compression format. For more information, see the LZ4 Block Format Description on GitHub.

    • LZ4 Framed – The usual framing implementation of LZ4. For more information, see the LZ4 Frame Format Description on GitHub.

    • LZ4 Hadoop-Compatible – The Apache Hadoop implementation of LZ4. This implementation wraps LZ4 compression with the BlockCompressorStream.java class.

  • LZO – Format that uses the Lempel–Ziv–Oberhumer algorithm, which focuses on high compression and decompression speed rather than the maximum compression of data. LZO has two implementations:

    • Standard LZO – For more information, see the LZO abstract on the Oberhumer website.

    • LZO Hadoop-Compatible – This implementation wraps the LZO algorithm with the BlockCompressorStream.java class.

  • SNAPPY – Compression algorithm that is part of the Lempel-Ziv 77 (LZ7) family. Snappy focuses on high compression and decompression speed rather than the maximum compression of data.

    Some implementations of Snappy allow for framing. Framing enables decompression of streaming or file data that cannot be entirely maintained in memory. The following framing implementations are relevant for Athena:

    • Snappy Raw/Unframed – The standard implementation of the Snappy format that does not use framing. For more information, see the Snappy format description on GitHub.

    • Snappy-Framed – The framing implementation of the Snappy format. For more information, see the Snappy framing format description on GitHub.

    • Snappy Hadoop-Compatible – The framing implementation of Snappy that the Apache Hadoop Project uses. For more information, see BlockCompressorStream.java on GitHub.

    For information about the Snappy framing methods that Athena supports for each file format, see the table later on this page.

  • ZLIB – Based on Deflate, ZLIB is the default write compression format for files in the ORC data storage format. For more information, see the zlib page on GitHub.

  • ZSTANDARD – The Zstandard real-time data compression algorithm is a fast compression algorithm that provides high compression ratios. The Zstandard library is provided as open source software using a BSD license. Athena supports reading and writing ZStandard compressed ORC, Parquet, and textfile data. When writing ZStandard compressed data, Athena uses ZStandard compression level 3.

Compression Support in Athena by File Format

The following table summarizes the compression format support in Athena for each storage file format. Textfile format includes TSV, CSV, JSON, and custom SerDes for text.

Avro ORC Parquet Textfile
BZIP2 Read support only. Write not supported. No No Yes
DEFLATE Yes No No No
GZIP No No Yes Yes
LZ4 No Yes (raw/unframed) No Hadoop-compatible read support. No write support.
LZO No No Yes Hadoop-compatible read support. No write support .
SNAPPY Raw/unframed read support. Write not supported. Yes (raw/unframed) Yes (raw/unframed) Yes (Hadoop-compatible framing)
ZLIB No Yes No No
ZSTANDARD No Yes Yes Yes

Specifying Compression Formats

When you write CREATE TABLE or CTAS statements, you can specify compression properties that specify the compression type to use when Athena writes to those tables.

Notes and Resources

  • For data in CSV, TSV, and JSON, Athena determines the compression type from the file extension. If no file extension is present, Athena treats the data as uncompressed plain text. If your data is compressed, make sure the file name includes the compression extension, such as gz.

  • The ZIP file format is not supported.

  • For querying Amazon Kinesis Data Firehose logs from Athena, supported formats include GZIP compression or ORC files with SNAPPY compression.

  • For more information on using compression, see section 3 ("Compress and split files") of the AWS Big Data Blog post Top 10 Performance Tuning Tips for Amazon Athena.