Data types in Amazon Athena
When you run CREATE TABLE
, you specify column names and the data type that
each column can contain. The tables that you create are stored in the AWS Glue Data Catalog.
To facilitate interoperability with other query engines, Athena uses Apache
HiveCREATE TABLE
. For DML
queries like SELECT
, CTAS
, and INSERT INTO
, Athena
uses Trino
DDL | DML | Description |
---|---|---|
BOOLEAN | Values are true and false . |
|
TINYINT | An 8-bit signed integer in two's complement format, with a minimum value of -27 and a maximum value of 27-1. | |
SMALLINT | A 16-bit signed integer in two's complement format, with a minimum value of -215 and a maximum value of 215-1. | |
INT, INTEGER | A 32-bit signed value in two's complement format, with a minimum value of -231 and a maximum value of 231-1. | |
BIGINT | A 64-bit signed integer in two's complement format, with a minimum value of -263 and a maximum value of 263-1. | |
FLOAT | REAL | A 32-bit signed single-precision floating point number. The range is 1.40129846432481707e-45 to 3.40282346638528860e+38, positive or negative. Follows the IEEE Standard for Floating-Point Arithmetic (IEEE 754). |
DOUBLE | A 64-bit signed double-precision floating point number. The range is 4.94065645841246544e-324d to 1.79769313486231570e+308d, positive or negative. Follows the IEEE Standard for Floating-Point Arithmetic (IEEE 754). | |
DECIMAL(precision ,
scale ) |
is the total number of
digits. (optional) is the
number of digits in fractional part with a default of 0. For example, use
these type definitions: decimal(11,5) ,
decimal(15) . The maximum value for
precision is 38, and the maximum value for
scale is 38. |
|
CHAR,
CHAR(length ) |
Fixed length character data, with a specified length between 1 and
255, such as char(10). If For more information, see CHAR Hive data type |
|
STRING | VARCHAR | Variable length character data. |
VARCHAR(length ) |
Variable length character data with a maximum read length. Strings are truncated at the specified length when read. If the underlying data string is longer, the underlying data string remains unchanged. | |
BINARY | VARBINARY | Variable length binary data. |
TIME | A time of day with millisecond precision. | |
Not available | TIME(precision ) |
A time of day with a specific precision. TIME(3) is
equivalent to TIME . |
Not available | TIME WITH TIME ZONE | A time of day in a time zone. Time zones should be specified as offsets from UTC. |
DATE | A calendar date with year, month, and day. | |
TIMESTAMP | TIMESTAMP, TIMESTAMP WITHOUT TIME ZONE | A calendar date and time of day with millisecond precision. |
Not available | TIMESTAMP(precision ),
TIMESTAMP(precision ) WITHOUT TIME
ZONE |
A calendar date and time of day with a specific precision.
TIMESTAMP(3) is equivalent to
TIMESTAMP . |
Not available | TIMESTAMP WITH TIME ZONE | A calendar date and time of day in a time zone. Time zones can be specified as offsets from UTC, as IANA time zone names, or using UTC, UT, Z, or GMT. |
Not available | TIMESTAMP(precision ) WITH TIME ZONE |
A calendar date and time of day with a specific precision, in a time zone. |
Not available | INTERVAL YEAR TO MONTH | An interval of one or more whole months |
Not available | INTERVAL DAY TO SECOND | An interval of one or more seconds, minutes, hours, or days |
ARRAY<element_type > |
ARRAY(element_type ) |
An array of values. All values must have the same type. |
MAP<key_type ,
value_type > |
MAP(key_type ,
value_type ) |
A map where values can be looked up by key. All keys must have the same value, and all values must have the same value. |
STRUCT<field_name_1 :field_type_1 ,
field_name_2 :field_type_2 ,
…> |
ROW(field_name_1
field_type_1 ,
field_name_2
field_type_2 , …) |
A data structure with named fields and their values. |
Not available | JSON | JSON value type, which can be a JSON object, a JSON array, a JSON number,
a JSON string, true , false or
null . |
Not available | UUID | A UUID (Universally Unique IDentifier). |
Not available | IPADDRESS | An IPv4 or IPv6 address. |
Not available | HyperLogLog |
These data types support approximate function internals. For more information about each type, visit the link to the corresponding entry in the Trino documentation. |
P4HyperLogLog |
||
SetDigest |
||
QDigest |
||
TDigest |
Data type examples
The following table shows example literals for DML data types.
Data type | Examples |
---|---|
BOOLEAN |
|
TINYINT |
|
SMALLINT |
|
INT, INTEGER |
|
BIGINT |
|
REAL |
|
DOUBLE |
|
DECIMAL(precision ,
scale ) |
|
CHAR, CHAR(length ) |
|
VARCHAR, VARCHAR(length ) |
|
VARBINARY |
|
TIME, TIME(precision ) |
|
TIME WITH TIME ZONE |
|
DATE |
|
TIMESTAMP, TIMESTAMP WITHOUT TIME ZONE,
TIMESTAMP( |
|
TIMESTAMP WITH TIME ZONE,
TIMESTAMP(precision ) WITH TIME
ZONE |
|
INTERVAL YEAR TO MONTH |
|
INTERVAL DAY TO SECOND |
|
ARRAY[element_type ] |
|
MAP(key_type ,
value_type ) |
Note that maps are created from an array of keys and an array of values. |
ROW(field_name_1
field_type_1 ,
field_name_2
field_type_2 , …) |
Note that rows created this way have no column names. To add
column names, you can use
|
JSON |
|
UUID |
|
IPADDRESS |
|
Considerations for data types
CHAR and VARCHAR
A CHAR(
value always has a count of
n
)
characters. For example, if you
cast 'abc' to n
CHAR(7)
, 4 trailing spaces are added.
Comparisons of CHAR
values include leading and trailing spaces.
If a length is specified for CHAR
or VARCHAR
, strings
are truncated at the specified length when read. If the underlying data string is
longer, the underlying data string remains unchanged.
To escape a single quote in a CHAR
or VARCHAR
, use an
additional single quote.
To cast a non-string data type to a string in a DML query, cast to the
VARCHAR
data type.
To use the substr
function to return a substring of specified length
from a CHAR
data type, you must first cast the CHAR
value
as a VARCHAR
. In the following example, col1
uses the
CHAR
data type.
substr(CAST(col1 AS VARCHAR), 1, 4)
DECIMAL
To specify decimal values as literals in SELECT
queries, such as when
selecting rows with a specific decimal value, you can specify the
DECIMAL
type and list the decimal value as a literal in single
quotes in your query, as in the following examples.
SELECT * FROM my_table WHERE decimal_value = DECIMAL '0.12'
SELECT DECIMAL '44.6' + DECIMAL '77.2'
Working with timestamp data
This section describes some considerations for working with timestamp data in Athena.
Note
The treatment of timestamps has changed somewhat between Athena engine version 2 and Athena engine version 3. For information about timestamp-related errors that can occur in Athena engine version 3 and suggested solutions, see Timestamp changes in the Athena engine version 3 reference.
Format for writing timestamp data to Amazon S3 objects
The format in which timestamp data should be written into Amazon S3 objects depends on both the column data type and the SerDe library that you use.
-
If you have a table column of type
DATE
, Athena expects the corresponding column or property of the data to be a string in the ISO formatYYYY-MM-DD
, or a built-in date type like those for Parquet or ORC. -
If you have a table column of type
TIME
, Athena expects the corresponding column or property of the data to be a string in the ISO formatHH:MM:SS
, or a built-in time type like those for Parquet or ORC. -
If you have a table column of type
TIMESTAMP
, Athena expects the corresponding column or property of the data to be a string in the formatYYYY-MM-DD HH:MM:SS.SSS
(note the space between the date and time), or a built-in time type like those for Parquet, ORC, or Ion.Note
OpenCSVSerDe timestamps are an exception and must be encoded as millisecond resolution UNIX epochs.
Ensuring that time-partitioned data matches the timestamp field in a record
The producer of the data must make sure partition values align with the data within
the partition. For example, if your data has a timestamp
property and you
use Firehose to load the data into Amazon S3, you must use dynamic partitioning
because the default partitioning of Firehose is wall-clock-based.
Use string as the data type for partition keys
For performance reasons, it is preferable to use STRING
as the data type
for partition keys. Even though Athena recognizes partition values in the format
YYYY-MM-DD
as dates when you use the DATE
type, this can
lead to poor performance. For this reason, we recommend that you use the
STRING
data type for partition keys instead.
How to write queries for timestamp fields that are also time-partitioned
How you write queries for timestamp fields that are time-partitioned depends on the type of table that you want to query.
Hive tables
With the Hive tables most commonly used in Athena, the query engine has no knowledge of relationships between columns and partition keys. For this reason, you must always add predicates in your queries for both the column and the partition key.
For example, suppose you have an event_time
column and an
event_date
partition key and want to query events between 23:00 and
03:00. In this case, you must include predicates in your query for both the column
and the partition key, as in the following example.
WHERE event_time BETWEEN
start_time
ANDend_time
AND event_date BETWEENstart_time_date
ANDend_time_date
Iceberg tables
With Iceberg tables, you can use computed partition values, which simplifies your
queries. For example, suppose your Iceberg table was created with a
PARTITIONED BY
clause like the following:
PARTITIONED BY (event_date month(event_time))
In this case, the query engine automatically prunes partitions based on the values
of the event_time
predicates. Because of this, your query only needs to
specify a predicate for event_time
, as in the following example.
WHERE event_time BETWEEN
start_time
ANDend_time
For more information, see Creating Iceberg tables.