Journal table example queries Inventory table example queries

Example metadata table queries

The following examples show how you can get different types information from your S3 Metadata tables by using standard SQL queries.

Remember when using these examples:

The examples are written to work with Amazon Athena. You might have to modify the examples to work with a different query engine.
Make sure that you understand how to optimize your queries.
Replace b_general-purpose-bucket-name with the name of your namespace.
For a full list of supported columns, see the S3 Metadata journal tables schema and S3 Metadata live inventory tables schema.

Journal table example queries

You can use the following example queries to query your journal tables.

Finding objects by file extension

The following query returns objects with a specific file extension (.jpg in this case):


SELECT key FROM "s3tablescatalog/aws-s3"."b_general-purpose-bucket-name"."journal"
WHERE key LIKE '%.jpg'
AND record_type = 'CREATE'

Listing object deletions

The following query returns object deletion events, including the AWS account ID or AWS service principal that made the request:


SELECT DISTINCT bucket, key, sequence_number, record_type, record_timestamp, requester, source_ip_address, version_id
FROM "s3tablescatalog/aws-s3"."b_general-purpose-bucket-name"."journal"
WHERE record_type = 'DELETE';

Listing AWS KMS encryption keys used by your objects

The following query returns the ARNs of the AWS Key Management Service (AWS KMS) keys encrypting your objects:


SELECT DISTINCT kms_key_arn
FROM "s3tablescatalog/aws-s3"."b_general-purpose-bucket-name"."journal";

Listing objects that don't use KMS keys

The following query returns objects that aren't encrypted with AWS KMS keys:


SELECT DISTINCT kms_key_arn
FROM "s3tablescatalog/aws-s3"."b_general-purpose-bucket-name"."journal"
WHERE encryption_status NOT IN ('SSE-KMS', 'DSSE-KMS')
AND record_type = 'CREATE';

Listing AWS KMS encryption keys used for `PUT` operations in the last 7 days

The following query returns the ARNs of the AWS Key Management Service (AWS KMS) keys encrypting your objects:


SELECT DISTINCT kms_key_arn 
FROM "s3tablescatalog/aws-s3"."b_general-purpose-bucket-name"."journal"
WHERE record_timestamp > (current_date - interval '7' day)
AND kms_key_arn is NOT NULL;

Listing objects deleted in the last 24 hours by S3 Lifecycle

The following query returns lists the objects expired in the last day by S3 Lifecycle:


SELECT bucket, key, version_id, last_modified_date, record_timestamp, requester
FROM "s3tablescatalog/aws-s3"."b_general-purpose-bucket-name"."journal"
WHERE requester = 's3.amazonaws.com'
AND record_type = 'DELETE' 
AND record_timestamp > (current_date - interval '1' day)

Viewing metadata provided by Amazon Bedrock

Some AWS services (such as Amazon Bedrock), upload objects to Amazon S3. You can query the object metadata provided by these services. For example, the following query includes the user_metadata column to determine if there are objects uploaded by Amazon Bedrock to a general purpose bucket:


SELECT DISTINCT bucket, key, sequence_number, record_type, record_timestamp, user_metadata
FROM "s3tablescatalog/aws-s3"."b_general-purpose-bucket-name"."journal"
WHERE record_type = 'CREATE'
AND user_metadata['content-source'] = 'AmazonBedrock';

If Amazon Bedrock uploaded an object to your bucket, the user_metadata column will display the following metadata associated with the object in the query result:


user_metadata
{content-additional-params -> requestid="CVK8FWYRW0M9JW65", signedContentSHA384="38b060a751ac96384cd9327eb1b1e36a21fdb71114be07434c0cc7bf63f6e1da274edebfe76f65fbd51ad2f14898b95b", content-model-id -> bedrock-model-arn, content-source -> AmazonBedrock}

Understanding the current state of your objects

The following query can help you determine the current state of your objects. The query identifies the most recent version of each object, filters out deleted objects, and marks the latest version of each object based on sequence numbers. Results are ordered by the bucket, key, and sequence_number columns.


WITH records_of_interest as (
   -- Start with a query that can narrow down the records of interest.
    SELECT * from "s3tablescatalog/aws-s3"."b_general-purpose-bucket-name"."journal"
),

version_stacks as (
   SELECT *,
          -- Introduce a column called 'next_sequence_number', which is the next larger
          -- sequence_number for the same key version_id in sorted order.
          LEAD(sequence_number, 1) over (partition by (bucket, key, coalesce(version_id, '')) order by sequence_number ASC) as next_sequence_number
   from records_of_interest
),

-- Pick the 'tip' of each version stack triple: (bucket, key, version_id).
-- The tip of the version stack is the row of that triple with the largest sequencer.
-- Selecting only the tip filters out any row duplicates.
-- This isn't typical, but some events can be delivered more than once to the table
-- and include rows that might no longer exist in the bucket (since the
-- table contains rows for both extant and extinct objects).
-- In the next subquery, eliminate the rows that contain deleted objects.
current_versions as (
    SELECT * from version_stacks where next_sequence_number is NULL
),

-- Eliminate the rows that are extinct from the bucket by filtering with
-- record_type. An object version has been deleted from the bucket if its tip is
-- record_type==DELETE.
existing_current_versions as (
    SELECT * from current_versions where not (record_type = 'DELETE' and is_delete_marker = FALSE)
),

-- Optionally, to determine which of several object versions is the 'latest',
-- you can compare their sequence numbers. A version_id is the latest if its
-- tip's sequencer is the largest among all other tips in the same key.
with_is_latest as (
    SELECT *,
           -- Determine if the sequence_number of this row is the same as the largest sequencer for the key that still exists.
           sequence_number = (MAX(sequence_number) over (partition by (bucket, key))) as is_latest_version
    FROM existing_current_versions
)

SELECT * from with_is_latest
ORDER BY bucket, key, sequence_number;

Inventory table example queries

You can use the following example queries to query your inventory tables.

Discovering datasets that use specific tags

The following query returns the dataset that uses the specified tags:


SELECT * 
FROM "s3tablescatalog/aws-s3"."b_general-purpose-bucket-name"."inventory"
WHERE object_tags['key1'] = 'value1'
AND object_tags['key2'] = 'value2';

Listing objects not encrypted with SSE-KMS

The following query returns objects that aren't encrypted with SSE-KMS:


SELECT key, encryption_status 
FROM "s3tablescatalog/aws-s3"."b_general-purpose-bucket-name"."inventory"
WHERE encryption_status != 'SSE-KMS';

Listing objects that aren't encrypted

The following query returns objects that aren't encrypted:


SELECT bucket, key, version_id  
FROM "s3tablescatalog/aws-s3"."b_general-purpose-bucket-name"."inventory"
WHERE encryption_status IS NULL;

Listing objects generated by Amazon Bedrock

The following query lists objects that were generated by Amazon Bedrock:


SELECT DISTINCT bucket, key, sequence_number, user_metadata
FROM "s3tablescatalog/aws-s3"."b_general-purpose-bucket-name"."inventory"
WHERE user_metadata['content-source'] = 'AmazonBedrock';

Reconciling the inventory table with the journal table

The following query generates an inventory-table-like list that's up to date with the current contents of the bucket. More precisely, the resulting list combines the latest snapshot of the inventory table with the latest events in the journal table.

For this query to produce the most accurate results, both the journal and inventory tables must be in Active status.

We recommend using this query for general purpose buckets containing fewer than a billion (10^9) objects.

This example query applies the following simplifications to the list results (compared to the inventory table):

Column omissions – The columns bucket, is_multipart, encryption_status, is_bucket_key_enabled, kms_key_arn, and checksum_algorithm aren't part of the final results. Keeping the set of optional columns to a minimum improves performance.
Inclusion of all records – The query returns all object keys and versions, including the null version (in unversioned or versioning-suspended buckets) and delete markers. For examples of how to filter the results to show only the keys that you're interested in, see the WHERE clause at the end of the query.
Accelerated reconciliation – The query could, in rare cases, temporarily report objects that are no longer in the bucket. Those discrepancies are eliminated as soon as the next snapshot of the inventory table becomes available. This behavior is a tradeoff between performance and accuracy.

To run this query in Amazon Athena, make sure to select the s3tablescatalog/aws-s3 catalog and the b_general-purpose-bucket-name database for the general purpose bucket metadata configuration that contains your journal and inventory tables.


WITH inventory_time_cte AS (
    SELECT COALESCE(inventory_time_from_property, inventory_time_default) AS inventory_time FROM
    (
      SELECT * FROM
        (VALUES (TIMESTAMP '2024-12-01 00:00')) AS T (inventory_time_default)
      LEFT OUTER JOIN
        (
         SELECT from_unixtime(CAST(value AS BIGINT) / 1000.0) AS inventory_time_from_property FROM "journal$properties"
         WHERE key = 'aws.s3metadata.oldest-uncoalesced-record-timestamp' LIMIT 1
        )
      ON TRUE
    )
),

working_set AS (
    SELECT
        key,
        sequence_number,
        version_id,
        is_delete_marker,
        size,
        COALESCE(last_modified_date, record_timestamp) AS last_modified_date,
        e_tag,
        storage_class,
        object_tags,
        user_metadata,
        (record_type = 'DELETE' AND NOT COALESCE(is_delete_marker, FALSE)) AS _is_perm_delete
    FROM journal j
    CROSS JOIN inventory_time_cte t
    WHERE j.record_timestamp > (t.inventory_time - interval '15' minute)

    UNION ALL

    SELECT
        key,
        sequence_number,
        version_id,
        is_delete_marker,
        size,
        last_modified_date,
        e_tag,
        storage_class,
        object_tags,
        user_metadata,
        FALSE AS _is_perm_delete
    FROM inventory i
),

updated_inventory AS (
    SELECT * FROM (
        SELECT *,
            MAX(sequence_number) OVER (PARTITION BY key, version_id) AS _supremum_sn
        FROM working_set
    )
    WHERE sequence_number = _supremum_sn
)

SELECT
    key,
    sequence_number,
    version_id,
    is_delete_marker,
    size,
    last_modified_date,
    e_tag,
    storage_class,
    object_tags,
    user_metadata
FROM updated_inventory
-- This filter omits only permanent deletes from the results. Delete markers will still be shown.
WHERE NOT _is_perm_delete
-- You can add additional filters here. Examples:
--    AND object_tags['department'] = 'billing'
--    AND starts_with(key, 'reports/')
ORDER BY key ASC, sequence_number DESC;

Finding the current versions of your objects

The following query uses the inventory table to generate a new output table that shows which object versions are current. The output table is intentionally similar to an S3 Inventory report. The output table includes an is_latest field, which indicates if an object is the current version. The is_latest field is equivalent to the IsLatest field in an S3 Inventory report.

This query works for general purpose buckets with S3 Versioning in a versioning-enabled or versioning-suspended state.

Prerequisites

The query outputs the results to a new S3 table to support further queries and for higher performance versus outputting rows on screen. Therefore, before running this query, make sure you've met the following conditions. If you choose not to output the results to a new table, you can skip these steps.

You must have an existing customer-managed table bucket with an existing namespace as a place to output the new table. For more information, see Creating a table bucket and Creating a namespace.
To query your new output table, you must set up an access method for querying it. For more information, see Accessing table data. If you want to query the output table with AWS analytics services such as Amazon Athena, your customer-managed table bucket must be integrated with AWS analytics services. For more information, see Amazon S3 Tables integration with AWS analytics services overview.

To use this query, replace amzn-s3-demo-table-bucket with the name of the existing customer-managed table bucket where you want the new output table to be created. Replace existing_namespace with the name of the namespace where you want the output table to be created in your table bucket. Replace new_table with the name that you want to use for the output table. Make sure that the name of your output table follows the table naming rules.


-- If you don't want to output the results to a new table, remove the following two lines 
-- (everything before the WITH clause). 
CREATE TABLE "s3tablescatalog/amzn-s3-demo-table-bucket"."existing_namespace"."new_table" 
as (
WITH 
my_inventory AS (
  SELECT 
        bucket,
        key,
        version_id,
        sequence_number,
        is_delete_marker,
        size,
        last_modified_date,
        storage_class
  FROM inventory
-- For prefix filtering, use a WHERE clause with % at the end.
--     WHERE key LIKE 'prefix%'
  ),
 
inventory_with_is_latest as (
SELECT *,
       ROW_NUMBER() OVER (
         PARTITION BY key 
         ORDER BY sequence_number DESC
       ) = 1 AS is_latest
FROM my_inventory
    )

SELECT
        bucket,
        key,
        version_id,
        sequence_number,
        is_delete_marker,
        size,
        last_modified_date,
        storage_class,
        is_latest

FROM inventory_with_is_latest

-- If you want only the current version of each key, uncomment the following WHERE clause.
-- WHERE is_latest = TRUE
-- If you aren't outputting the results to a new table, remove the next line: 
);

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Optimizing query performance

Joining custom metadata