Athena engine version 3
For engine version 3, Athena has introduced a continuous integration approach to open
source software management that improves currency with the Trino
This release of Athena engine version 3 supports all the features of Athena engine version 2. This document highlights key differences between Athena engine version 2 and Athena engine version 3.
Getting started
To get started, either create a new Athena workgroup that uses Athena engine version 3 or configure an existing workgroup to use version 3. Any Athena workgroup can upgrade from engine version 2 to engine version 3 without interruption in your ability to submit queries.
For more information, see Changing Athena engine versions.
Improvements and new features
The features and updates listed include improvements from Athena itself and from
functionality incorporated from open source Trino. For an exhaustive list of SQL query
operators and functions, refer to the Trino
documentation
Added Features
Apache Spark bucketing algorithm support
Athena can read buckets generated by the Spark hash algorithm. To specify that
data was originally written by the Spark hash algorithm, put
('bucketing_format'='spark')
in the TBLPROPERTIES
clause of your CREATE TABLE
statement. If this property is not
specified, the Hive hash algorithm is used.
CREATE EXTERNAL TABLE `spark_bucket_table`( `id` int, `name` string ) CLUSTERED BY (`name`) INTO 8 BUCKETS STORED AS PARQUET LOCATION 's3://path/to/bucketed/table/' TBLPROPERTIES ('bucketing_format'='spark')
Added Functions
The functions in this section are new to Athena engine version 3.
Aggregate functions
listagg(x, separator) – Returns the concatenated input values, separated by the separator string.
SELECT listagg(value, ',') WITHIN GROUP (ORDER BY value) csv_value FROM (VALUES 'a', 'c', 'b') t(value);
Array functions
contains_sequence(x, seq) – Returns true if array x contains all array seq as a sequential subset (all values in the same consecutive order).
SELECT contains_sequence(ARRAY [1,2,3,4,5,6], ARRAY[1,2]);
Binary functions
murmur3(binary) – Computes the 128-bit MurmurHash3 hash of binary.
SELECT murmur3(from_base64('aaaaaa'));
Conversion functions
format_number(number) – Returns a formatted string using a unit symbol.
SELECT format_number(123456); -- '123K'
SELECT format_number(1000000); -- '1M'
Date and time functions
timezone_hour(timestamp) – Returns the hour of the time zone offset from timestamp.
SELECT EXTRACT(TIMEZONE_HOUR FROM TIMESTAMP '2020-05-10 12:34:56 +08:35');
timezone_minute(timestamp) – Returns the minute of the time zone offset from timestamp.
SELECT EXTRACT(TIMEZONE_MINUTE FROM TIMESTAMP '2020-05-10 12:34:56 +08:35');
Geospatial functions
to_encoded_polyline(Geometry) – Encodes a linestring or multipoint to a polyline.
SELECT to_encoded_polyline(ST_GeometryFromText( 'LINESTRING (-120.2 38.5, -120.95 40.7, -126.453 43.252)'));
from_encoded_polyline(varchar) – Decodes a polyline to a linestring.
SELECT ST_AsText(from_encoded_polyline('_p~iF~ps|U_ulLnnqC_mqNvxq`@'));
to_geojson_geometry(SphericalGeography) – Returns the specified spherical geography in GeoJSON format.
SELECT to_geojson_geometry(to_spherical_geography(ST_GeometryFromText( 'LINESTRING (0 0, 1 2, 3 4)')));
from_geojson_geometry(varchar) – Returns
the spherical geography type object from the GeoJSON representation, stripping
non geometry key/values. Feature
and FeatureCollection
are not supported.
SELECT from_geojson_geometry(to_geojson_geometry(to_spherical_geography(ST_GeometryFromText( 'LINESTRING (0 0, 1 2, 3 4)'))));
geometry_nearest_points(Geometry, Geometry)
– Returns the points on each geometry that are nearest each other. If
either geometry is empty, returns NULL. Otherwise, returns a row of two
Point
objects that have the minimum distance of any two points
on the geometries. The first point is from the first Geometry argument, the
second from the second Geometry argument. If there are multiple pairs with the
same minimum distance, one pair is chosen arbitrarily.
SELECT geometry_nearest_points(ST_GeometryFromText( 'LINESTRING (50 100, 50 200)'), ST_GeometryFromText( 'LINESTRING (10 10, 20 20)'));
Set Digest functions
make_set_digest(x) – Composes all input values of x into a setdigest.
SELECT make_set_digest(value) FROM (VALUES 1, 2, 3) T(value);
String functions
soundex(char) – Returns a character string that contains the phonetic representation of char.
SELECT name FROM nation WHERE SOUNDEX(name) = SOUNDEX('CHYNA'); -- CHINA
concat_ws(string0, string1, ..., stringN)
– Returns the concatenation of string1, string2, ...,
stringN
using string0
as a separator. If
string0
is null, then the return value is null. Any null values
provided in the arguments after the separator are skipped.
SELECT concat_ws(',', 'def', 'pqr', 'mno');
Window functions
GROUPS – Adds support for window frames based on groups.
SELECT array_agg(a) OVER( ORDER BY a ASC NULLS FIRST GROUPS BETWEEN 1 PRECEDING AND 2 FOLLOWING) FROM (VALUES 3, 3, 3, 2, 2, 1, null, null) T(a);
Performance improvements
Performance improvements in Athena engine version 3 include the following.
-
Faster AWS Glue table metadata retrieval – Queries that involve multiple tables will see reduced query planning time.
-
Dynamic filtering for RIGHT JOIN – Dynamic filtering is now enabled for right joins that have equality join conditions, as in the following example.
SELECT * FROM lineitem RIGHT JOIN tpch.tiny.supplier ON lineitem.suppkey = supplier.suppkey WHERE supplier.name = 'abc';
Reliability enhancements
General engine memory usage and tracking in Athena engine version 3 have been improved. Large queries are less susceptible to failure from node crashes.
Query syntax enhancements
INTERSECT ALL – Added support for
INTERSECT ALL
.
SELECT * FROM (VALUES 1, 2, 3, 4) INTERSECT ALL SELECT * FROM (VALUES 3, 4);
EXCEPT ALL – Added support for EXCEPT
ALL
.
SELECT * FROM (VALUES 1, 2, 3, 4) EXCEPT ALL SELECT * FROM (VALUES 3, 4);
RANGE PRECEDING – Added support for
RANGE PRECEDING
in window functions.
SELECT sum(x) over (order by x range 1 preceding) FROM (values (1), (1), (2), (2)) t(x);
MATCH_RECOGNIZE – Added support for row pattern matching, as in the following example.
SELECT m.id AS row_id, m.match, m.val, m.label FROM (VALUES(1, 90),(2, 80),(3, 70),(4, 70)) t(id, value) MATCH_RECOGNIZE ( ORDER BY id MEASURES match_number() AS match, RUNNING LAST(value) AS val, classifier() AS label ALL ROWS PER MATCH AFTER MATCH SKIP PAST LAST ROW PATTERN (() | A) DEFINE A AS true ) AS m;
Data format and data type enhancements
Athena engine version 3 has the following data format and data type enhancements.
-
LZ4 and ZSTD – Added support for reading LZ4 and ZSTD compressed Parquet data. Added support for writing ZSTD compressed ORC data.
-
Symlink-based tables – Added support for creating symlink-based tables on Avro files. An example follows.
CREATE TABLE test_avro_symlink ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' ... INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
-
SphericalGeography – The SphericalGeography type provides native support for spatial features represented on geographic coordinates (sometimes called geodetic coordinates, lat/lon, or lon/lat). Geographic coordinates are spherical coordinates expressed in angular units (degrees).
The
to_spherical_geography
function returns geographic (spherical) coordinates from geometric (planar) coordinates, as in the following example.SELECT to_spherical_geography(ST_GeometryFromText( 'LINESTRING (-40.2 28.9, -40.2 31.9, -37.2 31.9)'));
Miscellaneous enhancements
-
Large prepared statements – Increased the default HTTP request/response header size to 2 MB to allow large prepared statements.
Limitations
Athena engine version 3 has the following limitations.
-
Query performance – Many queries run faster on Athena engine version 3, but some query plans can differ from Athena engine version 2. As a result, some queries can differ in latency or cost.
-
Trino and Presto connectors – Neither Trino
nor Presto connectors are supported. Use Amazon Athena Federated Query to connect data sources. For more information, see Using Amazon Athena Federated Query. -
Fault-tolerant execution – Trino fault-tolerant execution
(Trino Tardigrade) is not supported.
Breaking changes
When you migrate from Athena engine version 2 to Athena engine version 3, certain changes can affect table schema, syntax, or data type usage. This section lists the associated error messages and provides suggested workarounds.
Query syntax changes
BOOL_OR() and BOOL_AND() cannot be used together with null treatment
Error message: Cannot specify null
treatment clause for
bool_or
function.
Cause: Ignore nulls cannot be added for the
BOOL_OR ()
function. This change was made to conform to the
ANSI SQL specification.
Suggested solution: Remove ignore nulls from the query strings.
CONCAT function must have two or more arguments
Error Message:
INVALID_FUNCTION_ARGUMENT: There must be two or more
concatenation arguments
Cause: Previously, the CONCAT
string function accepted a single argument. In Athena engine version 3, the CONCAT
function requires a minimum of two arguments.
Suggested solution: Change occurrences of
CONCAT(str)
to CONCAT(str, '')
.
Geospatial function does not support varbinary input
Error message: FUNCTION_NOT_FOUND
for st_XXX
Cause: A few geospatial functions no longer
support the legacy VARBINARY
input type or text related function
signatures.
Suggested solution: Use geospatial functions to convert the input types to types that are supported. Supported input types are indicated in the error message.
Minute() function does not support interval year to month
Error message: Unexpected parameters
(interval year to month) for function minute. Expected: minute(timestamp
with time zone) , minute(time with time zone) , minute(timestamp) ,
minute(time) , minute(interval day to second).
Cause: In Athena engine version 3, type checks have been made
more precise for EXTRACT
in accordance with the ANSI SQL
specification.
Suggested solution: Update the queries to make sure types are matched with the suggested function signatures.
ORDER BY expressions must appear in SELECT list
Error message: For SELECT DISTINCT,
ORDER BY expressions must appear in SELECT list
Cause: Incorrect table aliasing is used in a
SELECT
clause.
Suggested solution: Double check that all
columns in the ORDER BY
expression have proper references in the
SELECT DISTINCT
clause.
SKIP is a reserved word for DML queries
The word SKIP
is now a reserved word for DML queries like
SELECT
. To use SKIP
as an identifier in a DML
query, enclose it in double quotes.
For more information about reserved words in Athena, see Reserved keywords.
SYSTEM_TIME and SYSTEM_VERSION clauses deprecated for time travel
Error message: mismatched input
'SYSTEM_TIME'. Expecting: 'TIMESTAMP', 'VERSION'
Cause: In Athena engine version 2, Iceberg tables used the
FOR SYSTEM_TIME AS OF
and FOR SYSTEM_VERSION AS OF
clauses for timestamp and version time travel. Athena engine version 3 uses the FOR
TIMESTAMP AS OF
and FOR VERSION AS OF
clauses.
Suggested solution: Update the SQL query to use
the TIMESTAMP AS OF
and VERSION AS OF
clauses for time
travel operations, as in the following examples.
Time travel by timestamp:
SELECT * FROM TABLE FOR TIMESTAMP AS OF (current_timestamp - interval '1' day)
Time travel by version:
SELECT * FROM TABLE FOR VERSION AS OF 949530903748831860
Zero-length delimited identifier not allowed
Error message: Zero-length delimited
identifier not allowed.
Cause: A query used an empty string as a column alias.
Suggested solution: Update the query to use a non-empty alias for the column.
Data processing changes
Casting a struct to JSON now returns field names
When you cast a struct
to JSON in a SELECT
query in
Athena engine version 3, the cast now returns both the field names and the values (for example
"useragent":null
instead of just the values (for example,
null
).
Iceberg table column level security enforcement change
Error Message: Access Denied: Cannot
select from columns
Cause: The Iceberg table was created outside
Athena and uses an Apache
Iceberg SDK
Suggested solution: Perform an update using the Athena ALTER TABLE SET PROPERTIES statement or use the latest Iceberg SDK to fix the table and update the column information in AWS Glue.
Nulls in List data types are now propagated to UDFs
Error message: Null Pointer
Exception
Cause: This issue can affect you if you use the UDF connector and have implemented a user defined Lambda function.
Athena engine version 2 filtered out the nulls in List data types that were passed to a user defined function. In Athena engine version 3, the nulls are now preserved and passed on to the UDF. This can cause a null pointer exception if the UDF attempts to dereference the null element without checking.
For example, if you have the data [null, 1, null, 2, 3, 4]
in an
originating data source like DynamoDB, the following are passed to the user-defined
Lambda function:
Athena engine version 2: [1,2,3,4]
Athena engine version 3: [null, 1, null, 2, 3,
4]
Suggested solution: Ensure that your user-defined Lambda function handles null elements in list data types.
Substrings from character arrays no longer contain padded spaces
Error message: No error is thrown, but the
string returned no longer contains padded spaces. For example,
substr(char[20],1,100)
now returns a string with length 20
instead of 100.
Suggested solution: No action is required.
Unsupported decimal column type coercion
Error message: Unsupported column
type (varchar) for Parquet column ([XXX] optional binary
column
Cause: When doing conversions, Athena engine version 2 read values at the byte level. Athena engine version 3 introduces type validation and checks that the type is compatible before trying to read the value.
Suggested Solution: Update your AWS Glue schema
and make sure that decimal type columns in Parquet files are not defined as
VARCHAR
in AWS Glue.
Timestamp changes
Casting a Timestamp with time zone to varchar behavior change
In Athena engine version 2, casting a Timestamp
with time zone to
varchar
caused some time zone literals to change (for example,
US/Eastern
changed to America/New_York
). This
behavior does not occur in Athena engine version 3.
Casting from TimestampTZ to Timestamp is not supported
Error message: Casting a Timestamp with Time Zone to Timestamp is not supported.
Suggested solution: Any explicit or implicit cast from TimestampTZ to Timestamp throws the exception. If possible, remove the cast and use a different data type.
Date timestamp overflow throws error
Error message: Millis overflow:
XXX
Cause: Because ISO 8601 dates were not checked for overflow in Athena engine version 2, some dates produced a negative timestamp. Athena engine version 3 checks for this overflow and throws an exception.
Suggested Solution: Make sure the timestamp is within range.
Implicit type conversion from Timestamp type to the Int or BigInt type is not supported
Error message:
GENERIC_INTERNAL_ERROR: class io.trino.spi.type.BigIntType cannot
be cast to class io.trino.spi.type.TimestampType
Cause: A type mismatch between a Parquet column
type (Timestamp
) and a schema column type (Int
or
BigInt).
Suggested solution: Change a Parquet column type or a schema column type to have the same date type. Because changing a Parquet column type requires a data update, changing a schema column type is preferable.
Political time zones with TIME not supported
Error message: INVALID
LITERAL
Cause: Queries like SELECT TIME
'13:21:32.424 America/Los_Angeles'
.
Suggested solution: Avoid using political time
zones with TIME
.
Precision mismatch in Timestamp columns causes serialization error
Error message: SERIALIZATION_ERROR:
Could not serialize column '
COLUMNZ
' of type
'timestamp(3)' at position
X
:Y
COLUMNZ
is the output name of the column that
causes the issue. The numbers
X
:Y
indicate the
position of the column in the output.
Cause: Athena engine version 3 checks to make sure that the precision of timestamps in the data is the same as the precision specified for the column data type in the table specification. Currently, this precision is always 3. If the data has a precision greater than this, queries fail with the error noted.
Suggested solution: Check your data to make sure that your timestamps have millisecond precision.
Reading the Long type as Timestamp in ORC files now causes a malformed ORC file error
Error message: Error opening Hive
split ‘FILE (SPLIT POSITION)’ Malformed ORC file. Cannot read SQL type
timestamp from ORC stream .long_type of type LONG
Cause: Athena engine version 3 now rejects implicit coercion
from the Long
data type to Timestamp
. Previously,
Long
values were implicitly converted into timestamp as if they
were epoch milliseconds.
Suggested solution: Use the
from_unixtime
function to explicitly cast the column, or use
the from_unixtime
function to create an additional column for
future queries.
Time and interval year to month not supported
Error message: TYPE
MISMATCH
Cause: Athena engine version 3 does not support time and
interval year to month (for example, SELECT TIME '01:00' + INTERVAL '3'
MONTH
).
Timestamp overflow for int96 Parquet format
Error message: Invalid
timeOfDayNanos
Cause: A timestamp overflow for the
int96
Parquet format.
Suggested solution: Identify the specific files that have the issue. Then generate the data file again with an up-to-date, well known Parquet library, or use Athena CTAS. If the issue persists, contact Athena support and let us know how the data files are generated.
Timestamp values with precision greater than 3 are rounded
Error message: There is no error message, but values that were trimmed in Athena engine version 2 are rounded in Athena engine version 3.
Suggested solution: Exercise care when using
Timestamp
values with a precision greater than 3.