Example metadata table queries
The following examples show how you can get different types information from your S3 Metadata tables by using standard SQL queries.
Remember when using these examples:
-
The examples are written to work with Amazon Athena. You might have to modify the examples to work with a different query engine.
-
Make sure that you understand how to optimize your queries.
-
Replace
b_
with the name of your namespace.general-purpose-bucket-name
-
For a full list of supported columns, see the S3 Metadata journal tables schema and S3 Metadata live inventory tables schema.
Contents
Journal table example queries
You can use the following example queries to query your journal tables.
Finding objects by file extension
The following query returns objects with a specific file extension (.jpg
in this
case):
SELECT key FROM "s3tablescatalog/aws-s3"."
b_
"."journal" WHERE key LIKE '%.jpg' AND record_type = 'CREATE'general-purpose-bucket-name
Listing object deletions
The following query returns object deletion events, including the AWS account ID or AWS service principal that made the request:
SELECT DISTINCT bucket, key, sequence_number, record_type, record_timestamp, requester, source_ip_address, version_id FROM "s3tablescatalog/aws-s3"."
b_
"."journal" WHERE record_type = 'DELETE';general-purpose-bucket-name
Listing AWS KMS encryption keys used by your objects
The following query returns the ARNs of the AWS Key Management Service (AWS KMS) keys encrypting your objects:
SELECT DISTINCT kms_key_arn FROM "s3tablescatalog/aws-s3"."
b_
"."journal";general-purpose-bucket-name
Listing objects that don't use KMS keys
The following query returns objects that aren't encrypted with AWS KMS keys:
SELECT DISTINCT kms_key_arn FROM "s3tablescatalog/aws-s3"."
b_
"."journal" WHERE encryption_status NOT IN ('SSE-KMS', 'DSSE-KMS') AND record_type = 'CREATE';general-purpose-bucket-name
Listing AWS KMS encryption
keys used for PUT
operations in the last 7 days
The following query returns the ARNs of the AWS Key Management Service (AWS KMS) keys encrypting your objects:
SELECT DISTINCT kms_key_arn FROM "s3tablescatalog/aws-s3"."
b_
"."journal" WHERE record_timestamp > (current_date - interval '7' day) AND kms_key_arn is NOT NULL;general-purpose-bucket-name
Listing objects deleted in the last 24 hours by S3 Lifecycle
The following query returns lists the objects expired in the last day by S3 Lifecycle:
SELECT bucket, key, version_id, last_modified_date, record_timestamp, requester FROM "s3tablescatalog/aws-s3"."
b_
"."journal" WHERE requester = 's3.amazonaws.com' AND record_type = 'DELETE' AND record_timestamp > (current_date - interval '1' day)general-purpose-bucket-name
Viewing metadata provided by Amazon Bedrock
Some AWS services (such as Amazon Bedrock),
upload objects to Amazon S3. You can query the object metadata provided by these services. For example,
the following query includes the user_metadata
column to determine if there are objects
uploaded by Amazon Bedrock to a general purpose bucket:
SELECT DISTINCT bucket, key, sequence_number, record_type, record_timestamp, user_metadata FROM "s3tablescatalog/aws-s3"."
b_
"."journal" WHERE record_type = 'CREATE' AND user_metadata['content-source'] = 'AmazonBedrock';general-purpose-bucket-name
If Amazon Bedrock uploaded an object to your bucket, the user_metadata
column will display
the following metadata associated with the object in the query result:
user_metadata {content-additional-params -> requestid="CVK8FWYRW0M9JW65", signedContentSHA384="38b060a751ac96384cd9327eb1b1e36a21fdb71114be07434c0cc7bf63f6e1da274edebfe76f65fbd51ad2f14898b95b", content-model-id -> bedrock-model-arn, content-source -> AmazonBedrock}
Understanding the current state of your objects
The following query can help you determine the current state of your objects. The query
identifies the most recent version of each object, filters out deleted objects, and marks the latest
version of each object based on sequence numbers. Results are ordered by the bucket
,
key
, and sequence_number
columns.
WITH records_of_interest as ( -- Start with a query that can narrow down the records of interest. SELECT * from "s3tablescatalog/aws-s3"."
b_
"."journal" ), version_stacks as ( SELECT *, -- Introduce a column called 'next_sequence_number', which is the next larger -- sequence_number for the same key version_id in sorted order. LEAD(sequence_number, 1) over (partition by (bucket, key, coalesce(version_id, '')) order by sequence_number ASC) as next_sequence_number from records_of_interest ), -- Pick the 'tip' of each version stack triple: (bucket, key, version_id). -- The tip of the version stack is the row of that triple with the largest sequencer. -- Selecting only the tip filters out any row duplicates. -- This isn't typical, but some events can be delivered more than once to the table -- and include rows that might no longer exist in the bucket (since the -- table contains rows for both extant and extinct objects). -- In the next subquery, eliminate the rows that contain deleted objects. current_versions as ( SELECT * from version_stacks where next_sequence_number is NULL ), -- Eliminate the rows that are extinct from the bucket by filtering with -- record_type. An object version has been deleted from the bucket if its tip is -- record_type==DELETE. existing_current_versions as ( SELECT * from current_versions where not (record_type = 'DELETE' and is_delete_marker = FALSE) ), -- Optionally, to determine which of several object versions is the 'latest', -- you can compare their sequence numbers. A version_id is the latest if its -- tip's sequencer is the largest among all other tips in the same key. with_is_latest as ( SELECT *, -- Determine if the sequence_number of this row is the same as the largest sequencer for the key that still exists. sequence_number = (MAX(sequence_number) over (partition by (bucket, key))) as is_latest_version FROM existing_current_versions ) SELECT * from with_is_latest ORDER BY bucket, key, sequence_number;general-purpose-bucket-name
Inventory table example queries
You can use the following example queries to query your inventory tables.
Discovering datasets that use specific tags
The following query returns the dataset that uses the specified tags:
SELECT * FROM "s3tablescatalog/aws-s3"."
b_
"."inventory" WHERE object_tags['key1'] = 'value1' AND object_tags['key2'] = 'value2';general-purpose-bucket-name
Listing objects not encrypted with SSE-KMS
The following query returns objects that aren't encrypted with SSE-KMS:
SELECT key, encryption_status FROM "s3tablescatalog/aws-s3"."
b_
"."inventory" WHERE encryption_status != 'SSE-KMS';general-purpose-bucket-name
Listing objects that aren't encrypted
The following query returns objects that aren't encrypted:
SELECT bucket, key, version_id FROM "s3tablescatalog/aws-s3"."
b_
"."inventory" WHERE encryption_status IS NULL;general-purpose-bucket-name
Listing objects generated by Amazon Bedrock
The following query lists objects that were generated by Amazon Bedrock:
SELECT DISTINCT bucket, key, sequence_number, user_metadata FROM "s3tablescatalog/aws-s3"."
b_
"."inventory" WHERE user_metadata['content-source'] = 'AmazonBedrock';general-purpose-bucket-name
Reconciling the inventory table with the journal table
The following query generates an inventory-table-like list that's up to date with the current contents of the bucket. More precisely, the resulting list combines the latest snapshot of the inventory table with the latest events in the journal table.
For this query to produce the most accurate results, both the journal and inventory tables must be in Active status.
We recommend using this query for general purpose buckets containing fewer than a billion (10^9) objects.
This example query applies the following simplifications to the list results (compared to the inventory table):
-
Column omissions – The columns
bucket
,is_multipart
,encryption_status
,is_bucket_key_enabled
,kms_key_arn
, andchecksum_algorithm
aren't part of the final results. Keeping the set of optional columns to a minimum improves performance. -
Inclusion of all records – The query returns all object keys and versions, including the null version (in unversioned or versioning-suspended buckets) and delete markers. For examples of how to filter the results to show only the keys that you're interested in, see the
WHERE
clause at the end of the query. -
Accelerated reconciliation – The query could, in rare cases, temporarily report objects that are no longer in the bucket. Those discrepancies are eliminated as soon as the next snapshot of the inventory table becomes available. This behavior is a tradeoff between performance and accuracy.
WITH inventory_time_cte AS ( SELECT COALESCE(inventory_time_from_property, inventory_time_default) AS inventory_time FROM ( SELECT * FROM (VALUES (TIMESTAMP '2024-12-01 00:00')) AS T (inventory_time_default) LEFT OUTER JOIN ( SELECT from_unixtime(CAST(value AS BIGINT) / 1000.0) AS inventory_time_from_property FROM "journal$properties" WHERE key = 'aws.s3metadata.oldest-uncoalesced-record-timestamp' LIMIT 1 ) ON TRUE ) ), working_set AS ( SELECT key, sequence_number, version_id, is_delete_marker, size, COALESCE(last_modified_date, record_timestamp) AS last_modified_date, e_tag, storage_class, object_tags, user_metadata, (record_type = 'DELETE' AND NOT COALESCE(is_delete_marker, FALSE)) AS _is_perm_delete FROM journal j CROSS JOIN inventory_time_cte t WHERE j.record_timestamp > (t.inventory_time - interval '15' minute) UNION ALL SELECT key, sequence_number, version_id, is_delete_marker, size, last_modified_date, e_tag, storage_class, object_tags, user_metadata, FALSE AS _is_perm_delete FROM inventory i ), updated_inventory AS ( SELECT * FROM ( SELECT *, MAX(sequence_number) OVER (PARTITION BY key, version_id) AS _supremum_sn FROM working_set ) WHERE sequence_number = _supremum_sn ) SELECT key, sequence_number, version_id, is_delete_marker, size, last_modified_date, e_tag, storage_class, object_tags, user_metadata FROM updated_inventory -- This filter omits only permanent deletes from the results. Delete markers will still be shown. WHERE NOT _is_perm_delete -- You can add additional filters here. Examples: -- AND object_tags['department'] = 'billing' -- AND starts_with(key, 'reports/') ORDER BY key ASC, sequence_number DESC