Data types in Amazon Athena - Amazon Athena

Data types in Amazon Athena

When you run CREATE TABLE, you specify column names and the data type that each column can contain. The tables that you create are stored in the AWS Glue Data Catalog.

To facilitate interoperability with other query engines, Athena uses Apache Hive data type names for DDL statements like CREATE TABLE. For DML queries like SELECT, CTAS, and INSERT INTO, Athena uses Trino data type names. The following table shows the data types supported in Athena. Where DDL and DML types differ in terms of name, availability, or syntax, they are shown in separate columns.

DDL DML Description
BOOLEAN Values are true and false.
TINYINT An 8-bit signed integer in two's complement format, with a minimum value of -27 and a maximum value of 27-1.
SMALLINT A 16-bit signed integer in two's complement format, with a minimum value of -215 and a maximum value of 215-1.
INT, INTEGER A 32-bit signed value in two's complement format, with a minimum value of -231 and a maximum value of 231-1.
BIGINT A 64-bit signed integer in two's complement format, with a minimum value of -263 and a maximum value of 263-1.
FLOAT REAL A 32-bit signed single-precision floating point number. The range is 1.40129846432481707e-45 to 3.40282346638528860e+38, positive or negative. Follows the IEEE Standard for Floating-Point Arithmetic (IEEE 754).
DOUBLE A 64-bit signed double-precision floating point number. The range is 4.94065645841246544e-324d to 1.79769313486231570e+308d, positive or negative. Follows the IEEE Standard for Floating-Point Arithmetic (IEEE 754).
DECIMAL(precision, scale) precision is the total number of digits. scale (optional) is the number of digits in fractional part with a default of 0. For example, use these type definitions: decimal(11,5), decimal(15). The maximum value for precision is 38, and the maximum value for scale is 38.
CHAR, CHAR(length)

Fixed length character data, with a specified length between 1 and 255, such as char(10). If length is specified, strings are truncated at the specified length when read. If the underlying data string is longer, the underlying data string remains unchanged.

For more information, see CHAR Hive data type.

STRING VARCHAR Variable length character data.
VARCHAR(length) Variable length character data with a maximum read length. Strings are truncated at the specified length when read. If the underlying data string is longer, the underlying data string remains unchanged.
BINARY VARBINARY Variable length binary data.
TIME A time of day with millisecond precision.
Not available TIME(precision) A time of day with a specific precision. TIME(3) is equivalent to TIME.
Not available TIME WITH TIME ZONE A time of day in a time zone. Time zones should be specified as offsets from UTC.
DATE A calendar date with year, month, and day.
TIMESTAMP TIMESTAMP, TIMESTAMP WITHOUT TIME ZONE A calendar date and time of day with millisecond precision.
Not available TIMESTAMP(precision), TIMESTAMP(precision) WITHOUT TIME ZONE A calendar date and time of day with a specific precision. TIMESTAMP(3) is equivalent to TIMESTAMP.
Not available TIMESTAMP WITH TIME ZONE A calendar date and time of day in a time zone. Time zones can be specified as offsets from UTC, as IANA time zone names, or using UTC, UT, Z, or GMT.
Not available TIMESTAMP(precision) WITH TIME ZONE A calendar date and time of day with a specific precision, in a time zone.
Not available INTERVAL YEAR TO MONTH An interval of one or more whole months
Not available INTERVAL DAY TO SECOND An interval of one or more seconds, minutes, hours, or days
ARRAY<element_type> ARRAY(element_type) An array of values. All values must have the same type.
MAP<key_type, value_type> MAP(key_type, value_type) A map where values can be looked up by key. All keys must have the same value, and all values must have the same value.
STRUCT<field_name_1:field_type_1, field_name_2:field_type_2, …> ROW(field_name_1 field_type_1, field_name_2 field_type_2, …) A data structure with named fields and their values.
Not available JSON JSON value type, which can be a JSON object, a JSON array, a JSON number, a JSON string, true, false or null.
Not available UUID A UUID (Universally Unique IDentifier).
Not available IPADDRESS An IPv4 or IPv6 address.
Not available HyperLogLog These data types support approximate function internals. For more information about each type, visit the link to the corresponding entry in the Trino documentation.
P4HyperLogLog
SetDigest
QDigest
TDigest

Data type examples

The following table shows example literals for DML data types.

Data type Examples
BOOLEAN

true

false

TINYINT

TINYINT '123'

SMALLINT

SMALLINT '123'

INT, INTEGER

123456790

BIGINT

BIGINT '1234567890'

2147483648

REAL

'123456.78'

DOUBLE

1.234

DECIMAL(precision, scale)

DECIMAL '123.456'

CHAR, CHAR(length)

CHAR 'hello world', CHAR 'hello ''world''!'

VARCHAR, VARCHAR(length)

VARCHAR 'hello world', VARCHAR 'hello ''world''!'

VARBINARY

X'00 01 02'

TIME, TIME(precision)

TIME '10:11:12', TIME '10:11:12.345'

TIME WITH TIME ZONE

TIME '10:11:12.345 -06:00'

DATE

DATE '2024-03-25'

TIMESTAMP, TIMESTAMP WITHOUT TIME ZONE, TIMESTAMP(precision), TIMESTAMP(precision) WITHOUT TIME ZONE

TIMESTAMP '2024-03-25 11:12:13', TIMESTAMP '2024-03-25 11:12:13.456'

TIMESTAMP WITH TIME ZONE, TIMESTAMP(precision) WITH TIME ZONE

TIMESTAMP '2024-03-25 11:12:13.456 Europe/Berlin'

INTERVAL YEAR TO MONTH

INTERVAL '3' MONTH

INTERVAL DAY TO SECOND

INTERVAL '2' DAY

ARRAY[element_type]

ARRAY['one', 'two', 'three']

MAP(key_type, value_type)

MAP(ARRAY['one', 'two', 'three'], ARRAY[1, 2, 3])

Note that maps are created from an array of keys and an array of values.

ROW(field_name_1 field_type_1, field_name_2 field_type_2, …)

ROW('one', 'two', 'three')

Note that rows created this way have no column names. To add column names, you can use CAST, as in the following example:

CAST(ROW(1, 2, 3) AS ROW(one INT, two INT, three INT))
JSON

JSON '{"one":1, "two": 2, "three": 3}'

UUID

UUID '12345678-90ab-cdef-1234-567890abcdef'

IPADDRESS

IPADDRESS '10.0.0.1'

IPADDRESS '2001:db8::1'

Considerations for data types

CHAR and VARCHAR

A CHAR(n) value always has a count of n characters. For example, if you cast 'abc' to CHAR(7), 4 trailing spaces are added.

Comparisons of CHAR values include leading and trailing spaces.

If a length is specified for CHAR or VARCHAR, strings are truncated at the specified length when read. If the underlying data string is longer, the underlying data string remains unchanged.

To escape a single quote in a CHAR or VARCHAR, use an additional single quote.

To cast a non-string data type to a string in a DML query, cast to the VARCHAR data type.

To use the substr function to return a substring of specified length from a CHAR data type, you must first cast the CHAR value as a VARCHAR. In the following example, col1 uses the CHAR data type.

substr(CAST(col1 AS VARCHAR), 1, 4)

DECIMAL

To specify decimal values as literals in SELECT queries, such as when selecting rows with a specific decimal value, you can specify the DECIMAL type and list the decimal value as a literal in single quotes in your query, as in the following examples.

SELECT * FROM my_table WHERE decimal_value = DECIMAL '0.12'
SELECT DECIMAL '44.6' + DECIMAL '77.2'

Working with timestamp data

This section describes some considerations for working with timestamp data in Athena.

Note

The treatment of timestamps has changed somewhat between Athena engine version 2 and Athena engine version 3. For information about timestamp-related errors that can occur in Athena engine version 3 and suggested solutions, see Timestamp changes in the Athena engine version 3 reference.

Format for writing timestamp data to Amazon S3 objects

The format in which timestamp data should be written into Amazon S3 objects depends on both the column data type and the SerDe library that you use.

  • If you have a table column of type DATE, Athena expects the corresponding column or property of the data to be a string in the ISO format YYYY-MM-DD, or a built-in date type like those for Parquet or ORC.

  • If you have a table column of type TIME, Athena expects the corresponding column or property of the data to be a string in the ISO format HH:MM:SS, or a built-in time type like those for Parquet or ORC.

  • If you have a table column of type TIMESTAMP, Athena expects the corresponding column or property of the data to be a string in the format YYYY-MM-DD HH:MM:SS.SSS (note the space between the date and time), or a built-in time type like those for Parquet, ORC, or Ion.

    Note

    OpenCSVSerDe timestamps are an exception and must be encoded as millisecond resolution UNIX epochs.

Ensuring that time-partitioned data matches the timestamp field in a record

The producer of the data must make sure partition values align with the data within the partition. For example, if your data has a timestamp property and you use Firehose to load the data into Amazon S3, you must use dynamic partitioning because the default partitioning of Firehose is wall-clock-based.

Use string as the data type for partition keys

For performance reasons, it is preferable to use STRING as the data type for partition keys. Even though Athena recognizes partition values in the format YYYY-MM-DD as dates when you use the DATE type, this can lead to poor performance. For this reason, we recommend that you use the STRING data type for partition keys instead.

How to write queries for timestamp fields that are also time-partitioned

How you write queries for timestamp fields that are time-partitioned depends on the type of table that you want to query.

Hive tables

With the Hive tables most commonly used in Athena, the query engine has no knowledge of relationships between columns and partition keys. For this reason, you must always add predicates in your queries for both the column and the partition key.

For example, suppose you have an event_time column and an event_date partition key and want to query events between 23:00 and 03:00. In this case, you must include predicates in your query for both the column and the partition key, as in the following example.

WHERE event_time BETWEEN start_time AND end_time AND event_date BETWEEN start_time_date AND end_time_date

Iceberg tables

With Iceberg tables, you can use computed partition values, which simplifies your queries. For example, suppose your Iceberg table was created with a PARTITIONED BY clause like the following:

PARTITIONED BY (event_date month(event_time))

In this case, the query engine automatically prunes partitions based on the values of the event_time predicates. Because of this, your query only needs to specify a predicate for event_time, as in the following example.

WHERE event_time BETWEEN start_time AND end_time

For more information, see Creating Iceberg tables.