Querying Linux Foundation Delta Lake tables
Linux Foundation Delta LakeMSCK REPAIR
statement.
The Delta Lake format stores the minimum and maximum values per column of each data file. The Athena implementation makes use of this information to enable file-skipping on predicates to eliminate unwanted files from consideration.
Considerations and limitations
Delta Lake support in Athena has the following considerations and limitations:
-
Tables with AWS Glue catalog only – Native Delta Lake support is supported only through tables registered with AWS Glue. If you have a Delta Lake table that is registered with another metastore, you can still keep it and treat it as your primary metastore. Because Delta Lake metadata is stored in the file system (for example, in Amazon S3) rather than in the metastore, Athena requires only the location property in AWS Glue to read from your Delta Lake tables.
-
V3 engine only – Delta Lake queries are supported only on Athena engine version 3. You must ensure that the workgroup you create is configured to use Athena engine version 3.
-
Delta Lake version – Athena uses Delta Lake version 2.0.2.
-
No time travel support – There is no support for queries that use Delta Lake’s time travel capabilities.
-
Read only – Write DML statements like
UPDATE
,INSERT
, orDELETE
are not supported. -
Lake Formation support – Lake Formation integration is available for Delta Lake tables with their schema in sync with AWS Glue. For more information, see Using AWS Lake Formation with Amazon Athena and Set up permissions for a Delta Lake table in the AWS Lake Formation Developer Guide.
-
Limited DDL support – The following DDL statements are supported:
CREATE EXTERNAL TABLE
,SHOW COLUMNS
,SHOW TBLPROPERTIES
,SHOW PARTITIONS
,SHOW CREATE TABLE
, andDESCRIBE
. For information on using theCREATE EXTERNAL TABLE
statement, see the Getting started section. -
Skipping S3 Glacier objects not supported – If objects in the Linux Foundation Delta Lake table are in an Amazon S3 Glacier storage class, setting the
read_restored_glacier_objects
table property tofalse
has no effect.For example, suppose you issue the following command:
ALTER TABLE
table_name
SET TBLPROPERTIES ('read_restored_glacier_objects' = 'false')For Iceberg and Delta Lake tables, the command produces the error
Unsupported table property key: read_restored_glacier_objects
. For Hudi tables, theALTER TABLE
command does not produce an error, but Amazon S3 Glacier objects are still not skipped. RunningSELECT
queries after theALTER TABLE
command continues to return all objects.
Supported non-partition column data types
For non-partition columns, all data types that Athena supports except CHAR
are supported (CHAR
is not supported in the Delta Lake protocol itself).
Supported data types include:
boolean tinyint smallint integer bigint double float decimal varchar string binary date timestamp array map struct
Supported partition column data types
For partition columns, Athena supports tables with the following data types:
boolean integer smallint tinyint bigint decimal float double date timestamp varchar
For more information about the data types in Athena, see Data types in Amazon Athena.
Getting started
To be queryable, your Delta Lake table must exist in AWS Glue. If your table is in Amazon S3
but not in AWS Glue, run a CREATE EXTERNAL TABLE
statement using the following
syntax. If your table already exists in AWS Glue (for example, because you are using Apache
Spark or another engine with AWS Glue), you can skip this step.
CREATE EXTERNAL TABLE [
db_name
.]table_name
LOCATION 's3://DOC-EXAMPLE-BUCKET
/your-folder
/' TBLPROPERTIES ('table_type' = 'DELTA')
Note the omission of column definitions, SerDe library, and other table properties. Unlike traditional Hive tables, Delta Lake table metadata are inferred from the Delta Lake transaction log and synchronized directly to AWS Glue.
Note
For Delta Lake tables, CREATE TABLE
statements that include more
than the LOCATION
and table_type
property are not
allowed.
Reading Delta Lake tables
To query a Delta Lake table, use standard SQL SELECT
syntax:
[ WITH with_query [, ...] ]SELECT [ ALL | DISTINCT ] select_expression [, ...] [ FROM from_item [, ...] ] [ WHERE condition ] [ GROUP BY [ ALL | DISTINCT ] grouping_element [, ...] ] [ HAVING condition ] [ { UNION | INTERSECT | EXCEPT } [ ALL | DISTINCT ] select ] [ ORDER BY expression [ ASC | DESC ] [ NULLS FIRST | NULLS LAST] [, ...] ] [ OFFSET count [ ROW | ROWS ] ] [ LIMIT [ count | ALL ] ]
For more information about SELECT
syntax, see SELECT in the Athena documentation.
The Delta Lake format stores the minimum and maximum values per column of each data file. Athena makes use of this information to enable file skipping on predicates to eliminate unnecessary files from consideration.
Synchronizing Delta Lake metadata
Athena synchronizes table metadata, including schema, partition columns, and table properties, to AWS Glue if you use Athena to create your Delta Lake table. As time passes, this metadata can lose its synchronization with the underlying table metadata in the transaction log. To keep your table up to date, you can choose one of the following options:
-
Use the AWS Glue crawler for Delta Lake tables. For more information, see Introducing native Delta Lake table support with AWS Glue crawlers
in the AWS Big Data Blog and Scheduling an AWS Glue crawler in the AWS Glue Developer Guide. -
Drop and recreate the table in Athena.
-
Use the SDK, CLI, or AWS Glue console to manually update the schema in AWS Glue.
Note that the following features require your AWS Glue schema to always have the same schema as the transaction log:
-
Lake Formation
-
Views
-
Row and column filters
If your workflow does not require any of this functionality, and you prefer not to
maintain this compatibility, you can use CREATE TABLE
DDL in Athena and then
add the Amazon S3 path as a SerDe parameter in AWS Glue.
To create a Delta Lake table using the Athena and AWS Glue consoles
Open the Athena console at https://console.aws.amazon.com/athena/
. -
In the Athena query editor, use the following DDL to create your Delta Lake table. Note that when using this method, the value for
TBLPROPERTIES
must be'spark.sql.sources.provider' = 'delta'
and not'table_type' = 'delta'
.Note that this same schema (with a single of column named
col
of typearray<string>
) is inserted when you use Apache Spark (Athena for Apache Spark) or most other engines to create your table.CREATE EXTERNAL TABLE [db_name.]table_name(col array<string>) LOCATION 's3://
DOC-EXAMPLE-BUCKET
/your-folder
/' TBLPROPERTIES ('spark.sql.sources.provider' = 'delta') Open the AWS Glue console at https://console.aws.amazon.com/glue/
. -
In the navigation pane, choose Data Catalog, Tables.
-
In the list of tables, choose the link for your table.
-
On the page for the table, choose Actions, Edit table.
-
In the Serde parameters section, add the key
path
with the values3://
.DOC-EXAMPLE-BUCKET
/your-folder
/ -
Choose Save.
See also
For a discussion of using Delta Lake tables with AWS Glue and querying them with Athena,
see Handle UPSERT data operations using open-source Delta Lake and AWS Glue