Amazon Data Firehose example - Amazon Athena

Amazon Data Firehose example

When you use Firehose to deliver data to Amazon S3, the default configuration writes objects with keys that look like the following example:

s3://bucket/prefix/yyyy/MM/dd/HH/file.extension

To create an Athena table that finds the partitions automatically at query time, instead of having to add them to the AWS Glue Data Catalog as new data arrives, you can use partition projection.

The following CREATE TABLE example uses the default Firehose configuration.

CREATE EXTERNAL TABLE my_ingested_data ( ... ) ... PARTITIONED BY ( datehour STRING ) LOCATION "s3://DOC-EXAMPLE-BUCKET/prefix/" TBLPROPERTIES ( "projection.enabled" = "true", "projection.datehour.type" = "date", "projection.datehour.format" = "yyyy/MM/dd/HH", "projection.datehour.range" = "2021/01/01/00,NOW", "projection.datehour.interval" = "1", "projection.datehour.interval.unit" = "HOURS", "storage.location.template" = "s3://DOC-EXAMPLE-BUCKET/prefix/${datehour}/" )

The TBLPROPERTIES clause in the CREATE TABLE statement tells Athena the following:

  • Use partition projection when querying the table

  • The partition key datehour is of type date (which includes an optional time)

  • How the dates are formatted

  • The range of date times. Note that the values must be separated by a comma, not a hyphen.

  • Where to find the data on Amazon S3.

When you query the table, Athena calculates the values for datehour and uses the storage location template to generate a list of partition locations.

Using the date type

When you use the date type for a projected partition key, you must specify a range. Because you have no data for dates before the Firehose delivery stream was created, you can use the date of creation as the start. And because you do not have data for dates in the future, you can use the special token NOW as the end.

In the CREATE TABLE example, the start date is specified as January 1, 2021 at midnight UTC.

Note

Configure a range that matches your data as closely as possible so that Athena looks only for existing partitions.

When a query is run on the sample table, Athena uses the conditions on the datehour partition key in combination with the range to generate values. Consider the following query:

SELECT * FROM my_ingested_data WHERE datehour >= '2020/12/15/00' AND datehour < '2021/02/03/15'

The first condition in the SELECT query uses a date that is before the start of the date range specified by the CREATE TABLE statement. Because the partition projection configuration specifies no partitions for dates before January 1, 2021, Athena looks for data only in the following locations, and ignores the earlier dates in the query.

s3://bucket/prefix/2021/01/01/00/ s3://bucket/prefix/2021/01/01/01/ s3://bucket/prefix/2021/01/01/02/ ... s3://bucket/prefix/2021/02/03/12/ s3://bucket/prefix/2021/02/03/13/ s3://bucket/prefix/2021/02/03/14/

Similarly, if the query ran at a date and time before February 3, 2021 at 15:00, the last partition would reflect the current date and time, not the date and time in the query condition.

If you want to query for the most recent data, you can take advantage of the fact that Athena does not generate future dates and specify only a beginning datehour, as in the following example.

SELECT * FROM my_ingested_data WHERE datehour >= '2021/11/09/00'

Choosing partition keys

You can specify how partition projection maps the partition locations to partition keys. In the CREATE TABLE example in the previous section, the date and hour were combined into one partition key called datehour, but other schemes are possible. For example, you could also configure a table with separate partition keys for the year, month, day, and hour.

However, splitting dates into year, month, and day means that the date partition projection type can't be used. An alternative is to separate the date from the hour to still leverage the date partition projection type, but make queries that specify hour ranges easier to read.

With that in mind, the following CREATE TABLE example separates the date from the hour. Because date is a reserved word in SQL, the example uses day as the name for the partition key that represents the date.

CREATE EXTERNAL TABLE my_ingested_data2 ( ... ) ... PARTITIONED BY ( day STRING, hour INT ) LOCATION "s3://DOC-EXAMPLE-BUCKET/prefix/" TBLPROPERTIES ( "projection.enabled" = "true", "projection.day.type" = "date", "projection.day.format" = "yyyy/MM/dd", "projection.day.range" = "2021/01/01,NOW", "projection.day.interval" = "1", "projection.day.interval.unit" = "DAYS", "projection.hour.type" = "integer", "projection.hour.range" = "0,23", "projection.hour.digits" = "2", "storage.location.template" = "s3://DOC-EXAMPLE-BUCKET/prefix/${day}/${hour}/" )

In the example CREATE TABLE statement, the hour is a separate partition key, configured as an integer. The configuration for the hour partition key specifies the range 0 to 23, and that the hour should be formatted with two digits when Athena generates the partition locations.

A query for the my_ingested_data2 table might look like this:

SELECT * FROM my_ingested_data2 WHERE day = '2021/11/09' AND hour > 3

Partition key types and partition projection types

Note that datehour key in the first CREATE TABLE example is configured as date in the partition projection configuration, but the type of the partition key is string. The same is true for day in the second example. The types in the partition projection configuration only tell Athena how to format the values when it generates the partition locations. The types that you specify do not change the type of the partition key — in queries, datehour and day are of type string.

When a query includes a condition like day = '2021/11/09', Athena parses the string on the right side of the expression using the date format specified in the partition projection configuration. After Athena verifies that the date is within the configured range, it uses the date format again to insert the date as a string into the storage location template.

Similarly, for a query condition like day > '2021/11/09', Athena parses the right side and generates a list of all matching dates within the configured range. It then uses the date format to insert each date into the storage location template to create the list of partition locations.

Writing the same condition as day > '2021-11-09' or day > DATE '2021-11-09' does not work. In the first case, the date format does not match (note the hyphens instead of the forward slashes), and in the second case, the data types do not match.

Using custom prefixes and dynamic partitioning

Firehose can be configured with custom prefixes and dynamic partitioning. Using these features, you can configure the Amazon S3 keys and set up partitioning schemes that better support your use case. You can also use partition projection with these partitioning schemes and configure them accordingly.

For example, you could use the custom prefix feature to get Amazon S3 keys that have ISO formatted dates instead of the default yyyy/MM/dd/HH scheme.

You can also combine custom prefixes with dynamic partitioning to extract a property like customer_id from Firehose messages, as in the following example.

prefix/!{timestamp:yyyy}-!{timestamp:MM}-!{timestamp:dd}/!{partitionKeyFromQuery:customer_id}/

With that Amazon S3 prefix, the Firehose delivery stream would write objects to keys such as s3://bucket/prefix/2021-11-01/customer-1234/file.extension. For a property like customer_id, where the values may not be known in advance, you can use the partition projection type injected and use a CREATE TABLE statement like the following:

CREATE EXTERNAL TABLE my_ingested_data3 ( ... ) ... PARTITIONED BY ( day STRING, customer_id STRING ) LOCATION "s3://DOC-EXAMPLE-BUCKET/prefix/" TBLPROPERTIES ( "projection.enabled" = "true", "projection.day.type" = "date", "projection.day.format" = "yyyy-MM-dd", "projection.day.range" = "2021-01-01,NOW", "projection.day.interval" = "1", "projection.day.interval.unit" = "DAYS", "projection.customer_id.type" = "injected", "storage.location.template" = "s3://DOC-EXAMPLE-BUCKET/prefix/${day}/${customer_id}/" )

When you query a table that has a partition key of type injected, your query must include a value for that partition key. A query for the my_ingested_data3 table might look like this:

SELECT * FROM my_ingested_data3 WHERE day BETWEEN '2021-11-01' AND '2021-11-30' AND customer_id = 'customer-1234'

ISO formatted dates

Because the values for the day partition key are ISO formatted, you can also use the DATE type for the day partition key instead of STRING, as in the following example:

PARTITIONED BY (day DATE, customer_id STRING)

When you query, this strategy allows you to use date functions on the partition key without parsing or casting, as in the following example:

SELECT * FROM my_ingested_data3 WHERE day > CURRENT_DATE - INTERVAL '7' DAY AND customer_id = 'customer-1234'
Note

Specifying a partition key of the DATE type assumes that you have used the custom prefix feature to create Amazon S3 keys that have ISO formatted dates. If you are using the default Firehose format of yyyy/MM/dd/HH, you must specify the partition key as type string even though the corresponding table property is of type date, as in the following example:

PARTITIONED BY ( `mydate` string) TBLPROPERTIES ( 'projection.enabled'='true', ... 'projection.mydate.type'='date', 'storage.location.template'='s3://DOC-EXAMPLE-BUCKET/prefix/${mydate}')