Querying Linux Foundation Delta Lake tables - Amazon Athena

Querying Linux Foundation Delta Lake tables

Linux Foundation Delta Lake is a table format for big data analytics. You can use Amazon Athena to read Delta Lake tables stored in Amazon S3 directly without having to generate manifest files or run the MSCK REPAIR statement.

The Delta Lake format stores the minimum and maximum values per column of each data file. The Athena implementation makes use of this information to enable file-skipping on predicates to eliminate unwanted files from consideration.

Considerations and limitations

Delta Lake support in Athena has the following considerations and limitations:

  • Tables with AWS Glue catalog only – Native Delta Lake support is supported only through tables registered with AWS Glue. If you have a Delta Lake table that is registered with another metastore, you can still keep it and treat it as your primary metastore. Because Delta Lake metadata is stored in the file system (for example, in Amazon S3) rather than in the metastore, Athena requires only the location property in AWS Glue to read from your Delta Lake tables.

  • V3 engine only – Delta Lake queries are supported only on Athena engine version 3. You must ensure that the workgroup you create is configured to use Athena engine version 3.

  • Delta Lake version – Athena uses Delta Lake version 2.0.2.

  • No time travel support – There is no support for queries that use Delta Lake’s time travel capabilities.

  • Read only – Write DML statements like UPDATE, INSERT, or DELETE are not supported.

  • Lake Formation support – Lake Formation integration is available for Delta Lake tables with their schema in sync with AWS Glue. For more information, see Using AWS Lake Formation with Amazon Athena and Set up permissions for a Delta Lake table in the AWS Lake Formation Developer Guide.

  • Limited DDL support – The following DDL statements are supported: CREATE EXTERNAL TABLE, SHOW COLUMNS, SHOW TBLPROPERTIES, SHOW PARTITIONS, SHOW CREATE TABLE, and DESCRIBE. For information on using the CREATE EXTERNAL TABLE statement, see the Getting started section.

  • Skipping S3 Glacier objects not supported – If objects in the Linux Foundation Delta Lake table are in an Amazon S3 Glacier storage class, setting the read_restored_glacier_objects table property to false has no effect.

    For example, suppose you issue the following command:

    ALTER TABLE table_name SET TBLPROPERTIES ('read_restored_glacier_objects' = 'false')

    For Iceberg and Delta Lake tables, the command produces the error Unsupported table property key: read_restored_glacier_objects. For Hudi tables, the ALTER TABLE command does not produce an error, but Amazon S3 Glacier objects are still not skipped. Running SELECT queries after the ALTER TABLE command continues to return all objects.

Supported non-partition column data types

For non-partition columns, all data types that Athena supports except CHAR are supported (CHAR is not supported in the Delta Lake protocol itself). Supported data types include:

boolean tinyint smallint integer bigint double float decimal varchar string binary date timestamp array map struct

Supported partition column data types

For partition columns, Athena supports tables with the following data types:

boolean integer smallint tinyint bigint decimal float double date timestamp varchar

For more information about the data types in Athena, see Data types in Amazon Athena.

Getting started

To be queryable, your Delta Lake table must exist in AWS Glue. If your table is in Amazon S3 but not in AWS Glue, run a CREATE EXTERNAL TABLE statement using the following syntax. If your table already exists in AWS Glue (for example, because you are using Apache Spark or another engine with AWS Glue), you can skip this step.

CREATE EXTERNAL TABLE [db_name.]table_name LOCATION 's3://DOC-EXAMPLE-BUCKET/your-folder/' TBLPROPERTIES ('table_type' = 'DELTA')

Note the omission of column definitions, SerDe library, and other table properties. Unlike traditional Hive tables, Delta Lake table metadata are inferred from the Delta Lake transaction log and synchronized directly to AWS Glue.

Note

For Delta Lake tables, CREATE TABLE statements that include more than the LOCATION and table_type property are not allowed.

Reading Delta Lake tables

To query a Delta Lake table, use standard SQL SELECT syntax:

[ WITH with_query [, ...] ]SELECT [ ALL | DISTINCT ] select_expression [, ...] [ FROM from_item [, ...] ] [ WHERE condition ] [ GROUP BY [ ALL | DISTINCT ] grouping_element [, ...] ] [ HAVING condition ] [ { UNION | INTERSECT | EXCEPT } [ ALL | DISTINCT ] select ] [ ORDER BY expression [ ASC | DESC ] [ NULLS FIRST | NULLS LAST] [, ...] ] [ OFFSET count [ ROW | ROWS ] ] [ LIMIT [ count | ALL ] ]

For more information about SELECT syntax, see SELECT in the Athena documentation.

The Delta Lake format stores the minimum and maximum values per column of each data file. Athena makes use of this information to enable file skipping on predicates to eliminate unnecessary files from consideration.

Synchronizing Delta Lake metadata

Athena synchronizes table metadata, including schema, partition columns, and table properties, to AWS Glue if you use Athena to create your Delta Lake table. As time passes, this metadata can lose its synchronization with the underlying table metadata in the transaction log. To keep your table up to date, you can choose one of the following options:

Note that the following features require your AWS Glue schema to always have the same schema as the transaction log:

  • Lake Formation

  • Views

  • Row and column filters

If your workflow does not require any of this functionality, and you prefer not to maintain this compatibility, you can use CREATE TABLE DDL in Athena and then add the Amazon S3 path as a SerDe parameter in AWS Glue.

To create a Delta Lake table using the Athena and AWS Glue consoles
  1. Open the Athena console at https://console.aws.amazon.com/athena/.

  2. In the Athena query editor, use the following DDL to create your Delta Lake table. Note that when using this method, the value for TBLPROPERTIES must be 'spark.sql.sources.provider' = 'delta' and not 'table_type' = 'delta'.

    Note that this same schema (with a single of column named col of type array<string>) is inserted when you use Apache Spark (Athena for Apache Spark) or most other engines to create your table.

    CREATE EXTERNAL TABLE [db_name.]table_name(col array<string>) LOCATION 's3://DOC-EXAMPLE-BUCKET/your-folder/' TBLPROPERTIES ('spark.sql.sources.provider' = 'delta')
  3. Open the AWS Glue console at https://console.aws.amazon.com/glue/.

  4. In the navigation pane, choose Data Catalog, Tables.

  5. In the list of tables, choose the link for your table.

  6. On the page for the table, choose Actions, Edit table.

  7. In the Serde parameters section, add the key path with the value s3://DOC-EXAMPLE-BUCKET/your-folder/.

  8. Choose Save.

See also

For a discussion of using Delta Lake tables with AWS Glue and querying them with Athena, see Handle UPSERT data operations using open-source Delta Lake and AWS Glue in the AWS Big Data Blog.