Example metadata table queries - Amazon Simple Storage Service

Example metadata table queries

The following examples show how you can get different types information from your S3 Metadata tables by using standard SQL queries.

Remember when using these examples:

Journal table example queries

You can use the following example queries to query your journal tables.

Finding objects by file extension

The following query returns objects with a specific file extension (.jpg in this case):

SELECT key FROM "s3tablescatalog/aws-s3"."b_general-purpose-bucket-name"."journal" WHERE key LIKE '%.jpg' AND record_type = 'CREATE'

Listing object deletions

The following query returns object deletion events, including the AWS account ID or AWS service principal that made the request:

SELECT DISTINCT bucket, key, sequence_number, record_type, record_timestamp, requester, source_ip_address, version_id FROM "s3tablescatalog/aws-s3"."b_general-purpose-bucket-name"."journal" WHERE record_type = 'DELETE';

Listing AWS KMS encryption keys used by your objects

The following query returns the ARNs of the AWS Key Management Service (AWS KMS) keys encrypting your objects:

SELECT DISTINCT kms_key_arn FROM "s3tablescatalog/aws-s3"."b_general-purpose-bucket-name"."journal";

Listing objects that don't use KMS keys

The following query returns objects that aren't encrypted with AWS KMS keys:

SELECT DISTINCT kms_key_arn FROM "s3tablescatalog/aws-s3"."b_general-purpose-bucket-name"."journal" WHERE encryption_status NOT IN ('SSE-KMS', 'DSSE-KMS') AND record_type = 'CREATE';

Listing AWS KMS encryption keys used for PUT operations in the last 7 days

The following query returns the ARNs of the AWS Key Management Service (AWS KMS) keys encrypting your objects:

SELECT DISTINCT kms_key_arn FROM "s3tablescatalog/aws-s3"."b_general-purpose-bucket-name"."journal" WHERE record_timestamp > (current_date - interval '7' day) AND kms_key_arn is NOT NULL;

Listing objects deleted in the last 24 hours by S3 Lifecycle

The following query returns lists the objects expired in the last day by S3 Lifecycle:

SELECT bucket, key, version_id, last_modified_date, record_timestamp, requester FROM "s3tablescatalog/aws-s3"."b_general-purpose-bucket-name"."journal" WHERE requester = 's3.amazonaws.com' AND record_type = 'DELETE' AND record_timestamp > (current_date - interval '1' day)

Viewing metadata provided by Amazon Bedrock

Some AWS services (such as Amazon Bedrock), upload objects to Amazon S3. You can query the object metadata provided by these services. For example, the following query includes the user_metadata column to determine if there are objects uploaded by Amazon Bedrock to a general purpose bucket:

SELECT DISTINCT bucket, key, sequence_number, record_type, record_timestamp, user_metadata FROM "s3tablescatalog/aws-s3"."b_general-purpose-bucket-name"."journal" WHERE record_type = 'CREATE' AND user_metadata['content-source'] = 'AmazonBedrock';

If Amazon Bedrock uploaded an object to your bucket, the user_metadata column will display the following metadata associated with the object in the query result:

user_metadata {content-additional-params -> requestid="CVK8FWYRW0M9JW65", signedContentSHA384="38b060a751ac96384cd9327eb1b1e36a21fdb71114be07434c0cc7bf63f6e1da274edebfe76f65fbd51ad2f14898b95b", content-model-id -> bedrock-model-arn, content-source -> AmazonBedrock}

Understanding the current state of your objects

The following query can help you determine the current state of your objects. The query identifies the most recent version of each object, filters out deleted objects, and marks the latest version of each object based on sequence numbers. Results are ordered by the bucket, key, and sequence_number columns.

WITH records_of_interest as ( -- Start with a query that can narrow down the records of interest. SELECT * from "s3tablescatalog/aws-s3"."b_general-purpose-bucket-name"."journal" ), version_stacks as ( SELECT *, -- Introduce a column called 'next_sequence_number', which is the next larger -- sequence_number for the same key version_id in sorted order. LEAD(sequence_number, 1) over (partition by (bucket, key, coalesce(version_id, '')) order by sequence_number ASC) as next_sequence_number from records_of_interest ), -- Pick the 'tip' of each version stack triple: (bucket, key, version_id). -- The tip of the version stack is the row of that triple with the largest sequencer. -- Selecting only the tip filters out any row duplicates. -- This isn't typical, but some events can be delivered more than once to the table -- and include rows that might no longer exist in the bucket (since the -- table contains rows for both extant and extinct objects). -- In the next subquery, eliminate the rows that contain deleted objects. current_versions as ( SELECT * from version_stacks where next_sequence_number is NULL ), -- Eliminate the rows that are extinct from the bucket by filtering with -- record_type. An object version has been deleted from the bucket if its tip is -- record_type==DELETE. existing_current_versions as ( SELECT * from current_versions where not (record_type = 'DELETE' and is_delete_marker = FALSE) ), -- Optionally, to determine which of several object versions is the 'latest', -- you can compare their sequence numbers. A version_id is the latest if its -- tip's sequencer is the largest among all other tips in the same key. with_is_latest as ( SELECT *, -- Determine if the sequence_number of this row is the same as the largest sequencer for the key that still exists. sequence_number = (MAX(sequence_number) over (partition by (bucket, key))) as is_latest_version FROM existing_current_versions ) SELECT * from with_is_latest ORDER BY bucket, key, sequence_number;

Inventory table example queries

You can use the following example queries to query your inventory tables.

Discovering datasets that use specific tags

The following query returns the dataset that uses the specified tags:

SELECT * FROM "s3tablescatalog/aws-s3"."b_general-purpose-bucket-name"."inventory" WHERE object_tags['key1'] = 'value1' AND object_tags['key2'] = 'value2';

Listing objects not encrypted with SSE-KMS

The following query returns objects that aren't encrypted with SSE-KMS:

SELECT key, encryption_status FROM "s3tablescatalog/aws-s3"."b_general-purpose-bucket-name"."inventory" WHERE encryption_status != 'SSE-KMS';

Listing objects that aren't encrypted

The following query returns objects that aren't encrypted:

SELECT bucket, key, version_id FROM "s3tablescatalog/aws-s3"."b_general-purpose-bucket-name"."inventory" WHERE encryption_status IS NULL;

Listing objects generated by Amazon Bedrock

The following query lists objects that were generated by Amazon Bedrock:

SELECT DISTINCT bucket, key, sequence_number, user_metadata FROM "s3tablescatalog/aws-s3"."b_general-purpose-bucket-name"."inventory" WHERE user_metadata['content-source'] = 'AmazonBedrock';

Reconciling the inventory table with the journal table

The following query generates an inventory-table-like list that's up to date with the current contents of the bucket. More precisely, the resulting list combines the latest snapshot of the inventory table with the latest events in the journal table.

For this query to produce the most accurate results, both the journal and inventory tables must be in Active status.

We recommend using this query for general purpose buckets containing fewer than a billion (10^9) objects.

This example query applies the following simplifications to the list results (compared to the inventory table):

  • Column omissions – The columns bucket, is_multipart, encryption_status, is_bucket_key_enabled, kms_key_arn, and checksum_algorithm aren't part of the final results. Keeping the set of optional columns to a minimum improves performance.

  • Inclusion of all records – The query returns all object keys and versions, including the null version (in unversioned or versioning-suspended buckets) and delete markers. For examples of how to filter the results to show only the keys that you're interested in, see the WHERE clause at the end of the query.

  • Accelerated reconciliation – The query could, in rare cases, temporarily report objects that are no longer in the bucket. Those discrepancies are eliminated as soon as the next snapshot of the inventory table becomes available. This behavior is a tradeoff between performance and accuracy.

WITH inventory_time_cte AS ( SELECT COALESCE(inventory_time_from_property, inventory_time_default) AS inventory_time FROM ( SELECT * FROM (VALUES (TIMESTAMP '2024-12-01 00:00')) AS T (inventory_time_default) LEFT OUTER JOIN ( SELECT from_unixtime(CAST(value AS BIGINT) / 1000.0) AS inventory_time_from_property FROM "journal$properties" WHERE key = 'aws.s3metadata.oldest-uncoalesced-record-timestamp' LIMIT 1 ) ON TRUE ) ), working_set AS ( SELECT key, sequence_number, version_id, is_delete_marker, size, COALESCE(last_modified_date, record_timestamp) AS last_modified_date, e_tag, storage_class, object_tags, user_metadata, (record_type = 'DELETE' AND NOT COALESCE(is_delete_marker, FALSE)) AS _is_perm_delete FROM journal j CROSS JOIN inventory_time_cte t WHERE j.record_timestamp > (t.inventory_time - interval '15' minute) UNION ALL SELECT key, sequence_number, version_id, is_delete_marker, size, last_modified_date, e_tag, storage_class, object_tags, user_metadata, FALSE AS _is_perm_delete FROM inventory i ), updated_inventory AS ( SELECT * FROM ( SELECT *, MAX(sequence_number) OVER (PARTITION BY key, version_id) AS _supremum_sn FROM working_set ) WHERE sequence_number = _supremum_sn ) SELECT key, sequence_number, version_id, is_delete_marker, size, last_modified_date, e_tag, storage_class, object_tags, user_metadata FROM updated_inventory -- This filter omits only permanent deletes from the results. Delete markers will still be shown. WHERE NOT _is_perm_delete -- You can add additional filters here. Examples: -- AND object_tags['department'] = 'billing' -- AND starts_with(key, 'reports/') ORDER BY key ASC, sequence_number DESC