Dynamic ID partitioning - Amazon Athena

Dynamic ID partitioning

When your data is partitioned by a property with high cardinality or when the values cannot be known in advance, you can use the injected projection type. Examples of such properties are user names, and IDs of devices or products. When you use the injected projection type to configure a partition key, Athena uses values from the query itself to compute the set of partitions that will be read.

For Athena to be able to run a query on a table that has a partition key configured with the injected projection type, the following must be true:

  • Your query must include at least one value for the partition key.

  • The value(s) must be literals or expressions that can be evaluated without reading any data.

If any of these criteria are not met, your query fails with the following error:

CONSTRAINT_VIOLATION: For the injected projected partition column column_name, the WHERE clause must contain only static equality conditions, and at least one such condition must be present.

Using the injected projection type

Imagine you have a data set that consists of events from IoT devices, partitioned on the devices' IDs. This data set has the following characteristics:

  • The device IDs are generated randomly.

  • New devices are provisioned frequently.

  • There are currently hundreds of thousands of devices, and in the future there will be millions.

This data set is difficult to manage using traditional metastores. It is difficult to keep the partitions in sync between the data storage and the metastore, and filtering partitions can be slow during query planning. But if you configure a table to use partition projection and use the injected projection type, you have two advantages: you don't have to manage partitions in the metastore, and your queries don't have to look up partition metadata.

The following CREATE TABLE example creates a table for the device event data set just described. The table uses the injected projection type.

CREATE EXTERNAL TABLE device_events ( event_time TIMESTAMP, data STRING, battery_level INT ) PARTITIONED BY ( device_id STRING ) LOCATION "s3://DOC-EXAMPLE-BUCKET/prefix/" TBLPROPERTIES ( "projection.enabled" = "true", "projection.device_id.type" = "injected", "storage.location.template" = "s3://DOC-EXAMPLE-BUCKET/prefix/${device_id}" )

The following example query looks up the number of events received from three specific devices over the course of 12 hours.

SELECT device_id, COUNT(*) AS events FROM device_events WHERE device_id IN ( '4a770164-0392-4a41-8565-40ed8cec737e', 'f71d12cf-f01f-4877-875d-128c23cbde17', '763421d8-b005-47c3-ba32-cc747ab32f9a' ) AND event_time BETWEEN TIMESTAMP '2023-11-01 20:00' AND TIMESTAMP '2023-11-02 08:00' GROUP BY device_id

When you run this query, Athena sees the three values for the device_id partition key and uses them to compute the partition locations. Athena uses the value for the storage.location.template property to generate the following locations:

  • s3://DOC-EXAMPLE-BUCKET/prefix/4a770164-0392-4a41-8565-40ed8cec737e

  • s3://DOC-EXAMPLE-BUCKET/prefix/f71d12cf-f01f-4877-875d-128c23cbde17

  • s3://DOC-EXAMPLE-BUCKET/prefix/763421d8-b005-47c3-ba32-cc747ab32f9a

If you leave out the storage.location.template property from the partition projection configuration, Athena uses Hive-style partitioning to project partition locations based on the value in LOCATION (for example, s3://DOC-EXAMPLE-BUCKET/prefix/device_id=4a770164-0392-4a41-8565-40ed8cec737e).