元数据表查询示例
以下示例说明如何使用标准 SQL 查询从 S3 元数据表中获取不同类型的信息。
使用这些示例时请记住:
-
这些示例是为与 Amazon Athena 结合使用而编写的。您可能需要修改示例才能使用其它查询引擎。
-
确保您了解如何优化查询。
-
将
b_
替换为命名空间的名称。general-purpose-bucket-name
-
有关支持的列的完整列表,请参阅 S3 元数据日记表架构和 S3 元数据实时清单表架构。
目录
日记表示例查询
可以使用以下示例查询来查询日记表。
按文件扩展名查找对象
以下查询返回具有特定文件扩展名(在本例中为 .jpg
)的对象:
SELECT key FROM "s3tablescatalog/aws-s3"."
b_
"."journal" WHERE key LIKE '%.jpg' AND record_type = 'CREATE'general-purpose-bucket-name
列出对象删除操作
以下查询返回对象删除事件,包括发出请求的 AWS 账户 ID 或 AWS 服务主体:
SELECT DISTINCT bucket, key, sequence_number, record_type, record_timestamp, requester, source_ip_address, version_id FROM "s3tablescatalog/aws-s3"."
b_
"."journal" WHERE record_type = 'DELETE';general-purpose-bucket-name
列出您的对象使用的 AWS KMS 加密密钥
以下查询返回用于加密对象的 AWS Key Management Service(AWS KMS)密钥的 ARN:
SELECT DISTINCT kms_key_arn FROM "s3tablescatalog/aws-s3"."
b_
"."journal";general-purpose-bucket-name
列出不使用 KMS 密钥的对象
以下查询返回未使用 AWS KMS 密钥加密的对象:
SELECT DISTINCT kms_key_arn FROM "s3tablescatalog/aws-s3"."
b_
"."journal" WHERE encryption_status NOT IN ('SSE-KMS', 'DSSE-KMS') AND record_type = 'CREATE';general-purpose-bucket-name
列出过去 7 天内用于 PUT
操作的 AWS KMS 加密密钥
以下查询返回用于加密对象的 AWS Key Management Service(AWS KMS)密钥的 ARN:
SELECT DISTINCT kms_key_arn FROM "s3tablescatalog/aws-s3"."
b_
"."journal" WHERE record_timestamp > (current_date - interval '7' day) AND kms_key_arn is NOT NULL;general-purpose-bucket-name
列出 S3 生命周期在过去 24 小时内删除的对象
以下查询返回 S3 生命周期在最后一天过期的对象:
SELECT bucket, key, version_id, last_modified_date, record_timestamp, requester FROM "s3tablescatalog/aws-s3"."
b_
"."journal" WHERE requester = 's3.amazonaws.com' AND record_type = 'DELETE' AND record_timestamp > (current_date - interval '1' day)general-purpose-bucket-name
查看 Amazon Bedrock 提供的元数据
某些 AWS 服务(例如 Amazon Bedrock)将对象上传到 Amazon S3。您可以查询这些服务提供的对象元数据。例如,以下查询包含用于确定 Amazon Bedrock 是否有对象上传到通用存储桶的 user_metadata
列:
SELECT DISTINCT bucket, key, sequence_number, record_type, record_timestamp, user_metadata FROM "s3tablescatalog/aws-s3"."
b_
"."journal" WHERE record_type = 'CREATE' AND user_metadata['content-source'] = 'AmazonBedrock';general-purpose-bucket-name
如果 Amazon Bedrock 将对象上传到存储桶,则 user_metadata
列将在查询结果中显示与该对象关联的以下元数据:
user_metadata {content-additional-params -> requestid="CVK8FWYRW0M9JW65", signedContentSHA384="38b060a751ac96384cd9327eb1b1e36a21fdb71114be07434c0cc7bf63f6e1da274edebfe76f65fbd51ad2f14898b95b", content-model-id -> bedrock-model-arn, content-source -> AmazonBedrock}
了解对象的当前状态
以下查询有助于您确定对象的当前状态。该查询可识别每个对象的最新版本,筛选掉已删除的对象,并根据序列号标记每个对象的最新版本。结果按 bucket
、key
和 sequence_number
列排序。
WITH records_of_interest as ( -- Start with a query that can narrow down the records of interest. SELECT * from "s3tablescatalog/aws-s3"."
b_
"."journal" ), version_stacks as ( SELECT *, -- Introduce a column called 'next_sequence_number', which is the next larger -- sequence_number for the same key version_id in sorted order. LEAD(sequence_number, 1) over (partition by (bucket, key, coalesce(version_id, '')) order by sequence_number ASC) as next_sequence_number from records_of_interest ), -- Pick the 'tip' of each version stack triple: (bucket, key, version_id). -- The tip of the version stack is the row of that triple with the largest sequencer. -- Selecting only the tip filters out any row duplicates. -- This isn't typical, but some events can be delivered more than once to the table -- and include rows that might no longer exist in the bucket (since the -- table contains rows for both extant and extinct objects). -- In the next subquery, eliminate the rows that contain deleted objects. current_versions as ( SELECT * from version_stacks where next_sequence_number is NULL ), -- Eliminate the rows that are extinct from the bucket by filtering with -- record_type. An object version has been deleted from the bucket if its tip is -- record_type==DELETE. existing_current_versions as ( SELECT * from current_versions where not (record_type = 'DELETE' and is_delete_marker = FALSE) ), -- Optionally, to determine which of several object versions is the 'latest', -- you can compare their sequence numbers. A version_id is the latest if its -- tip's sequencer is the largest among all other tips in the same key. with_is_latest as ( SELECT *, -- Determine if the sequence_number of this row is the same as the largest sequencer for the key that still exists. sequence_number = (MAX(sequence_number) over (partition by (bucket, key))) as is_latest_version FROM existing_current_versions ) SELECT * from with_is_latest ORDER BY bucket, key, sequence_number;general-purpose-bucket-name
清单表示例查询
可以使用以下示例查询来查询清单表。
发现使用特定标签的数据集
以下查询返回使用指定标签的数据集:
SELECT * FROM "s3tablescatalog/aws-s3"."
b_
"."inventory" WHERE object_tags['key1'] = 'value1' AND object_tags['key2'] = 'value2';general-purpose-bucket-name
列出未使用 SSE-KMS 加密的对象
以下查询返回未使用 SSE-KMS 加密的对象。
SELECT key, encryption_status FROM "s3tablescatalog/aws-s3"."
b_
"."inventory" WHERE encryption_status != 'SSE-KMS';general-purpose-bucket-name
列出 Amazon Bedrock 生成的对象
以下查询列出了由 Amazon Bedrock 生成的对象:
SELECT DISTINCT bucket, key, sequence_number, user_metadata FROM "s3tablescatalog/aws-s3"."
b_
"."inventory" WHERE user_metadata['content-source'] = 'AmazonBedrock';general-purpose-bucket-name
生成最新的清单表
以下查询生成最新的清单表。当元数据配置处于活动状态时,此查询将有效。此查询要求日记表和清单表均处于活动状态。如果清单表因权限或其它问题而不是最新的,则此查询可能不起作用。
我们建议将此查询用于对象数量少于十亿的通用存储桶。
此查询使清单表的内容与日记表的最近事件协调一致。当日记因纳入存储桶中发生的所有更改而变为最新时,查询结果将与存储桶的内容相匹配。
此示例将输出限制为仅限那些以 '%.txt'
结尾的键。要查询其它子集,可以调整名为 "working_set_of_interest"
的公用表表达式。
WITH inventory_time_cte AS ( -- Reveal the extent of the journal table that has not yet been reflected in the inventory table. SELECT COALESCE(inventory_time_from_property, inventory_time_default) AS inventory_time FROM ( SELECT * FROM -- The fallback default includes the entirety of the journal table. (VALUES (TIMESTAMP '2024-12-01 00:00')) AS T (inventory_time_default) LEFT OUTER JOIN -- This side queries the Iceberg table property and loads it up in -- a column. If the property doesn't exist, then you get 0 rows. ( SELECT from_unixtime(CAST(value AS BIGINT)) AS inventory_time_from_property FROM "journal$properties" WHERE key = 'aws.s3metadata.oldest-uncoalesced-record-timestamp' LIMIT 1 ) -- Force an unequivocal join. ON TRUE ) ), -- Select only those journal table events not yet reflected in the inventory table. my_new_events AS ( SELECT journal.* FROM ( journal JOIN inventory_time_cte -- Include only those rows that have yet to be merged with the inventory table. -- Allow some overlap to account for clock skew. ON record_timestamp > (inventory_time - interval '1' hour) ) ), -- Bring the "journal" and "inventory" table rows to a common inventory schema. working_set AS ( ( SELECT -- Keep the inventory table columns, but drop these journal table columns: -- "record_type", "requester", "source_ip_address", "request_id" bucket, key, sequence_number, version_id, is_delete_marker, size, COALESCE(last_modified_date, record_timestamp) AS last_modified_date, e_tag, storage_class, is_multipart, encryption_status, is_bucket_key_enabled, kms_key_arn, checksum_algorithm, object_tags, user_metadata, -- Temporary columns required to align the two tables. record_timestamp AS _log_ts, (record_type = 'DELETE' AND NOT COALESCE(is_delete_marker, FALSE)) AS _is_perm_delete FROM my_new_events ) UNION ( SELECT *, last_modified_date as _log_ts, FALSE AS _is_perm_delete FROM "inventory" ) ), -- You can apply a filter over key, tags, or metadata here to restrict your view to a subset of all keys. working_set_of_interest AS ( SELECT * FROM working_set WHERE key LIKE '%.txt' ), most_recent_changes AS ( -- For each (bucket, key, version_id) stack, find the event that should have -- been the ultimate to arrive in the journal table, and confine the results to the -- 1-hour window of events (for that key) that preceded that arrival. -- -- This gives preferential treatment to events that arrived later in the journal table -- order, and handles cases with uploads that were completed much later after they were -- initiated. SELECT * FROM ( SELECT *, -- Do not confuse this MAX() with the aggregate function. This is the MAX window function. MAX(_log_ts) OVER (PARTITION BY bucket, key, version_id) AS _supremum_ts FROM working_set_of_interest ) WHERE _log_ts >= (_supremum_ts - interval '1' hour) ), -- Among each "1-hour window of most recent mutations" for a given key, identify -- the one that is reflected in the general purpose bucket. updated_inventory AS ( SELECT * FROM ( SELECT *, MAX(sequence_number) OVER (PARTITION BY bucket, key, version_id) AS _supremum_sn FROM most_recent_changes ) WHERE sequence_number = _supremum_sn -- Again here, use QUALIFY clause if your planner supports it. ) -- Finally, project the resulting rows onto the inventory table schema. SELECT bucket, key, sequence_number, version_id, is_delete_marker, size, last_modified_date, e_tag, storage_class, is_multipart, encryption_status, is_bucket_key_enabled, kms_key_arn, checksum_algorithm, object_tags, user_metadata FROM updated_inventory WHERE NOT _is_perm_delete ORDER BY bucket, key ASC, sequence_number ASC