Athena Engine Version Reference
This section lists the changes to the Athena query engine since Athena engine version
1. Athena engine version 1 was
Athena's initial engine version and was based on Presto 0.172
Athena engine version 2
Athena engine version 2, which is based on Presto
0.217
Improvements and New Features
-
Federated queries – Federated queries are supported in Athena engine version 2. For more information, see Using Amazon Athena Federated Query.
-
Geospatial functions – More than 25 geospatial functions have been added. For more information, see New Geospatial Functions in Athena engine version 2.
-
Nested schema – Support has been added for reading nested schema, which reduces cost.
-
Schema evolution support – Schema evolution support has been added for data in Parquet format.
-
Added support for reading array, map, or row type columns from partitions where the partition schema is different from the table schema. This can occur when the table schema was updated after the partition was created. The changed column types must be compatible. For row types, trailing fields may be added or dropped, but the corresponding fields (by ordinal) must have the same name.
-
ORC files can now have struct columns with missing fields. This allows the table schema to be changed without rewriting the ORC files.
-
ORC struct columns are now mapped by name rather than ordinal. This correctly handles missing or extra struct fields in the ORC file.
-
Grouping, Join, and Subquery Improvements
-
Complex grouping – Added support for complex grouping operations.
-
Correlated subqueries – Added support for correlated subqueries in
IN
predicates and for correlated subqueries that require coercions. -
CROSS JOIN – Added support for
CROSS JOIN
againstLATERAL
derived tables. -
GROUPING SETS – Added support for
ORDER BY
clauses in aggregations for queries that useGROUPING SETS
. -
Lambda expressions – Added support for dereferencing row fields in Lambda expressions.
-
Null values in semijoins – Added support for null values on the left-hand side of a semijoin (that is, an
IN
predicate with subqueries). -
Spatial joins – Added support for broadcast spatial joins and spatial left joins.
-
Spill to disk – For memory intensive
INNER JOIN
andLEFT JOIN
operations, Athena offloads intermediate operation results to disk. This enables execution of queries that require large amounts of memory.
Datatype Enhancements
-
INT for INTEGER – Added support for
INT
as an alias for theINTEGER
data type. -
INTERVAL types – Added support for casting to
INTERVAL
types. -
IPADDRESS – Added a new
IPADDRESS
type to represent IP addresses. Added support for casting between theVARBINARY
type andIPADDRESS
type. -
IS DISTINCT FROM – Added
IS DISTINCT FROM
support for theJSON
andIPADDRESS
types. -
Null equality checks – Equality checks for null values in
ARRAY
,MAP
, andROW
data structures are now supported. For example, the expressionARRAY ['1', '3', null] = ARRAY ['1', '2', null]
returnsfalse
. Previously, a null element returned the error messagecomparison not supported
. -
Row type coercion – Coercion between row types regardless of field names is now allowed. Previously, a row type was coercible to another only if the field name in the source type matched the target type, or when the target type had an anonymous field name.
-
Time subtraction – Implemented subtraction for all
TIME
andTIMESTAMP
types. -
Unicode – Added support for escaped Unicode sequences in string literals.
-
VARBINARY concatenation – Added support for concatenation of
VARBINARY
values.
Additional Input Types for Functions
The following functions now accept additional input types. For more information about each function, visit the corresponding link to the Presto documentation.
-
approx_distinct() – The approx_distinct()
function now supports the following types: INTEGER
,SMALLINT
,TINYINT
,DECIMAL
,REAL
,DATE
,TIMESTAMP
,TIMESTAMP WITH TIME ZONE
,TIME
,TIME WITH TIME ZONE
,IPADDRESS
, andCHAR
. -
avg(), sum() – The avg()
and sum() aggregate functions now support the INTERVAL
data type. -
lpad(), rpad() – The lpad
and rpad functions now work on VARBINARY
inputs. -
min(), max() – The min()
and max() aggregation functions now allow unknown input types at query analysis time so that you can use the functions with NULL
literals. -
regexp_replace() – Variant of the regexp_replace()
function added that can execute a Lambda function for each replacement. -
sequence() – Added
DATE
variants for the sequence()function, including variant with an implicit one-day step increment. -
ST_Area() – The ST_Area()
geospatial function now supports all geometry types. -
substr() – The substr
function now works on VARBINARY
inputs. -
zip_with() – Arrays of mismatched length can now be used with zip_with()
. Missing positions are filled with null. Previously, an error was raised when arrays of differing lengths were passed. This change may make it difficult to distinguish between values that were originally null from values that were added to pad the arrays to the same length.
Added Functions
The following list contains functions that are new in Athena engine version 2. The list does not include geospatial functions. For a list of geospatial functions, see New Geospatial Functions in Athena engine version 2.
For more information about each function, visit the corresponding link to the Presto documentation.
Aggregate Functions
Array Functions and Operators
array_sort()
Binary Functions and Operators
Date and Time Functions and Operators
Map Functions and Operators
Mathematical Functions and Operators
Quantile Digest Functions
quantile
digest functionsqdigest
quantile digest type
added.
String Functions and Operators
Performance Improvements
Performance of the following features has improved in Athena engine version 2.
Query Performance
-
Bucketed tables – Improved performance for writing to bucketed tables when the data being written is already partitioned appropriately (for example, when the output is from a bucketed join).
-
DISTINCT – Improved performance for some queries that use
DISTINCT
. -
Filter and projection operations – Filter and projection operations are now always processed by columns if possible. The engine automatically takes advantage of dictionary encodings where effective.
-
Gathering exchanges – Improved performance for queries with gathering exchanges.
-
Global aggregations – Improved performance for some queries that perform filtered global aggregations.
-
GROUPING SETS, CUBE, ROLLUP – Improved performance for queries involving
GROUPING SETS
,CUBE
orROLLUP
. -
Highly selective filters – Improved the performance of queries with highly selective filters.
-
JOIN and AGGREGATE operations – The performance of
JOIN
andAGGREGATE
operations has been enhanced. -
LIKE – Improved the performance of queries that use
LIKE
predicates on the columns ofinformation_schema
tables. -
ORDER BY and LIMIT – Improved plans, performance, and memory usage for queries involving
ORDER BY
andLIMIT
to avoid unnecessary data exchanges. -
ORDER BY –
ORDER BY
operations are now distributed by default, enabling largerORDER BY
clauses to be used. -
ROW type conversions – Improved performance when converting between
ROW
types. -
Structural types – Improved performance of queries that process structural types and contain scan, joins, aggregations, or table writes.
-
UNION – Improved performance for
UNION
queries.
Query Planning Performance
-
Planning performance – Improved planning performance for queries that join multiple tables with a large number of columns.
-
Predicate evaluations – Improved predicate evaluation performance during predicate pushdown in planning.
-
Predicate pushdown support for casting – Support predicate pushdown for the
<column>
IN
<values list>
predicate where values in the values list require casting to match the type of column. -
Predicate inference and pushdown – Predicate inference and pushdown extended for queries that use a
<symbol>
IN
<subquery>
predicate.
Join Performance
-
Joins with map columns – Improved the performance of joins and aggregations that include map columns.
-
Joins with solely non-equality conditions – Improved the performance of joins with only non-equality conditions by using a nested loop join instead of a hash join.
-
Outer joins – The join distribution type is now automatically selected for queries involving outer joins.
-
Range over a function joins – Improved performance of joins where the condition is a range over a function (for example,
a JOIN b ON b.x < f(a.x) AND b.x > g(a.x)
).
Subquery Performance
-
Correlated EXISTS subqueries – Improved performance of correlated
EXISTS
subqueries. -
Correlated subqueries with equalities – Improved support for correlated subqueries containing equality predicates.
-
Correlated subqueries with inequalities – Improved performance for correlated subqueries that contain inequalities.
-
count(*) aggregations over subqueries – Improved performance of
count(*)
aggregations over subqueries with known constant cardinality. -
Outer query filter propagation – Improved performance of correlated subqueries when filters from the outer query can be propagated to the subquery.
Function Performance
-
Aggregate window functions – Improved performance of aggregate window functions.
-
element_at() – Improved performance of
element_at()
for maps to be constant time rather than proportional to the size of the map. -
grouping() – Improved performance for queries involving
grouping()
. -
JSON casting – Improved the performance of casting from
JSON
toARRAY
orMAP
types. -
Map-returning functions – Improved performance of functions that return maps.
-
Map-to-map casting – Improved the performance of map-to-map cast.
-
min() and max() – The
min()
andmax()
functions have been optimized to avoid unnecessary object creation, thus reducing garbage collection overhead. -
row_number() – Improved performance and memory usage for queries using
row_number()
followed by a filter on the row numbers generated. -
Window functions – Improved performance of queries containing window functions with identical
PARTITION BY
andORDER BY
clauses. -
Window functions – Improved performance of certain window functions (for example,
LAG
) that have similar specifications.
Geospatial Performance
-
Geometry serialization – Improved the serialization performance of geometry values.
-
Geospatial functions – Improved the performance of
ST_Intersects()
,ST_Contains()
,ST_Touches()
,ST_Within()
,ST_Overlaps()
,ST_Disjoint()
,transform_values()
,ST_XMin()
,ST_XMax()
,ST_YMin()
,ST_YMax()
,ST_Crosses()
, andarray_intersect()
. -
ST_Distance() – Improved performance of join queries involving the
ST_Distance()
function. -
ST_Intersection() – Optimized the
ST_Intersection()
function for rectangles aligned with coordinate axes (for example, polygons produced by theST_Envelope()
andbing_tile_polygon()
functions).
JSON-Related Improvements
Map Functions
-
Improved performance of map subscript from
O(n)
toO(1)
in all cases. Previously, only maps produced by certain functions and readers took advantage of this improvement. -
Added the
map_from_entries()
andmap_entries()
functions.
Casting
-
Added ability to cast to
JSON
fromREAL
,TINYINT
orSMALLINT
. -
You can now cast
JSON
toROW
even if theJSON
does not contain every field in theROW
. -
Improved performance of
CAST(json_parse(...) AS ...)
. -
Improved the performance of casting from
JSON
toARRAY
orMAP
types.
New JSON Functions
Breaking Changes
Breaking changes include bug fixes, changes to geospatial functions, replaced functions, and the introduction of limits. Improvements in ANSI SQL compliance may break queries that depended on non-standard behavior.
Bug Fixes
The following changes correct behavioral issues that caused queries to run successfully, but with inaccurate results.
-
json_parse() no longer ignores trailing characters – Previously, inputs such as
[1,2]abc
would successfully parse as[1,2]
. Using trailing characters now produces the error messageCannot convert '[1, 2]abc' to JSON
. -
round() decimal precision corrected –
round(x, d)
now correctly roundsx
whenx
is a DECIMAL or whenx
is a DECIMAL with scale 0 andd
is a negative integer. Previously, no rounding occurred in these cases. -
round(x, d) and truncate(x, d) – The parameter
d
in the signature of functionsround(x, d)
andtruncate(x, d)
is now of typeINTEGER
. Previously,d
could be of typeBIGINT
. -
map() with duplicate keys –
map()
now raises an error on duplicate keys rather than silently producing a corrupted map. Queries that currently construct map values using duplicate keys now fail with an error. -
map_from_entries() raises an error with null entries –
map_from_entries()
now raises an error when the input array contains a null entry. Queries that construct a map by passingNULL
as a value now fail. -
Tables – Tables that have unsupported partition types can no longer be created.
-
Improved numerical stability in statistical functions – The numerical stability for the statistical functions
corr()
,covar_samp()
,regr_intercept()
, andregr_slope()
has been improved. -
Time zone information – Time zone information is now calculated using the java.time
package of the Java 1.8 SDK. -
SUM of INTERVAL_DAY_TO_SECOND and INTERVAL_YEAR_TO_MONTH datatypes – You can no longer use
SUM(NULL)
directly. In order to useSUM(NULL)
, castNULL
to a data type likeBIGINT
,DECIMAL
,REAL
,DOUBLE
,INTERVAL_DAY_TO_SECOND
orINTERVAL_YEAR_TO_MONTH
.
Changes to Geospatial Functions
Changes made to geospatial functions include the following.
-
Function name changes – Some function names have changed. For more information, see Geospatial Function Name Changes in Athena engine version 2.
-
VARBINARY input – The
VARBINARY
type is no longer directly supported for input to geospatial functions. For example, to calculate the area of a geometry directly, the geometry must now be input in eitherVARCHAR
orGEOMETRY
format. The workaround is to use transform functions, as in the following examples.-
To use
ST_area()
to calculate the area forVARBINARY
input in Well-Known Binary (WKB) format, pass the input toST_GeomFromBinary()
first, for example:ST_area(ST_GeomFromBinary(
<wkb_varbinary_value>
)) -
To use
ST_area()
to calculate the area forVARBINARY
input in legacy binary format, pass the same input to theST_GeomFromLegacyBinary()
function first, for example:ST_area(ST_GeomFromLegacyBinary(
<legacy_varbinary_value>
))
-
-
ST_ExteriorRing() and ST_Polygon() – ST_ExteriorRing() and ST_Polygon() now accept only polygons as inputs. Previously, these functions erroneously accepted other geometries.
-
ST_Distance() – As required by the SQL/MM specification
, the ST_Distance() function now returns NULL
if one of the inputs is an empty geometry. Previously,NaN
was returned.
ANSI SQL Compliance
The following syntax and behavioral issues have been corrected to follow the ANSI SQL standard.
-
Cast() operations – Cast() operations from REAL or DOUBLE to DECIMAL now conform to the SQL standard. For example,
cast (double '100000000000000000000000000000000' as decimal(38))
previously returned100000000000000005366162204393472
but now returns100000000000000000000000000000000
. -
JOIN ... USING –
JOIN ... USING
now conforms to standard SQL semantics. Previously,JOIN ... USING
required qualifying the table name in columns, and the column from both tables would be present in the output. Table qualifications are now invalid and the column is present only once in the output. -
ROW type literals removed – The ROW type literal format
ROW<int, int>(1, 2)
is no longer supported. Use the syntaxROW(1 int, 2 int)
instead. -
log() function – Previously, in violation of the SQL standard, the order of the arguments in the
log()
function was reversed. This causedlog()
to return incorrect results when queries were translated to or from other SQL implementations. The equivalent tolog(x, b)
is now correctlyln(x) / ln(b)
. -
Grouped aggregation semantics – Grouped aggregations use
IS NOT DISTINCT FROM
semantics rather than equality semantics. Grouped aggregations now return correct results and show improved performance when grouping onNaN
floating point values. Grouping on map, list, and row types that contain nulls is supported. -
Types with quotation marks are no longer allowed – In accordance with the ANSI SQL standard, data types can no longer be enclosed in quotation marks. For example,
SELECT "date" '2020-02-02'
is no longer a valid query. Instead, use the syntaxSELECT date '2020-02-02'
. -
Anonymous row field access – Anonymous row fields can no longer be accessed by using the syntax [
.field0, .field1, ...
].
Replaced Functions
The following functions are no longer supported and have been replaced by syntax that produces the same results.
-
information_schema.__internal_partitions__ – The usage of
__internal_partitions__
is no longer supported. For equivalent syntax, useSELECT * FROM "
or<table_name>
$partitions"SHOW PARTITIONS
. For more information, see Listing Partitions for a Specific Table. -
Replaced geospatial functions – For a list of geospatial functions whose names have changed, see Geospatial Function Name Changes in Athena engine version 2.
Limits
The following limits were introduced in Athena engine version 2 to ensure that queries do not fail due to resource limitations. These limits are not configurable by users.
-
Number of result elements – The number of result elements
n
is restricted to 10,000 or less for the following functions:min(col, n)
,max(col, n)
,min_by(col1, col2, n)
, andmax_by(col1, col2, n)
. -
GROUPING SETS – The maximum number of slices in a grouping set is 2048.
-
Maximum text file line length – The default maximum line length for text files is 100 MB.
-
Sequence function maximum result size – The maximum result size of a sequence function is 50000 entries. For example,
SELECT sequence(0,45000,1)
succeeds, butSELECT sequence(0,55000,1)
fails with the error messageThe result of the sequence function must not have more than 50000 entries
. This limit applies to all input types for sequence functions, including timestamps.