Amazon Athena TPC benchmark DS (TPC-DS) connector - Amazon Athena

Amazon Athena TPC benchmark DS (TPC-DS) connector

The Amazon Athena TPC-DS connector enables Amazon Athena to communicate with a source of randomly generated TPC Benchmark DS data for use in benchmarking and functional testing of Athena Federation. The Athena TPC-DS connector generates a TPC-DS compliant database at one of four scale factors. We do not recommend the use of this connector as an alternative to Amazon S3-based data lake performance tests.

Prerequisites

Parameters

Use the Lambda environment variables in this section to configure the TPC-DS connector.

  • spill_bucket – Specifies the Amazon S3 bucket for data that exceeds Lambda function limits.

  • spill_prefix – (Optional) Defaults to a subfolder in the specified spill_bucket called athena-federation-spill. We recommend that you configure an Amazon S3 storage lifecycle on this location to delete spills older than a predetermined number of days or hours.

  • spill_put_request_headers – (Optional) A JSON encoded map of request headers and values for the Amazon S3 putObject request that is used for spilling (for example, {"x-amz-server-side-encryption" : "AES256"}). For other possible headers, see PutObject in the Amazon Simple Storage Service API Reference.

  • kms_key_id – (Optional) By default, any data that is spilled to Amazon S3 is encrypted using the AES-GCM authenticated encryption mode and a randomly generated key. To have your Lambda function use stronger encryption keys generated by KMS like a7e63k4b-8loc-40db-a2a1-4d0en2cd8331, you can specify a KMS key ID.

  • disable_spill_encryption – (Optional) When set to True, disables spill encryption. Defaults to False so that data that is spilled to S3 is encrypted using AES-GCM – either using a randomly generated key or KMS to generate keys. Disabling spill encryption can improve performance, especially if your spill location uses server-side encryption.

Test databases and tables

The Athena TPC-DS connector generates a TPC-DS compliant database at one of the four scale factors tpcds1, tpcds10, tpcds100, tpcds250, or tpcds1000.

Summary of tables

For a complete list of the test data tables and columns, run the SHOW TABLES or DESCRIBE TABLE queries. The following summary of tables is provided for convenience.

  1. call_center

  2. catalog_page

  3. catalog_returns

  4. catalog_sales

  5. customer

  6. customer_address

  7. customer_demographics

  8. date_dim

  9. dbgen_version

  10. household_demographics

  11. income_band

  12. inventory

  13. item

  14. promotion

  15. reason

  16. ship_mode

  17. store

  18. store_returns

  19. store_sales

  20. time_dim

  21. warehouse

  22. web_page

  23. web_returns

  24. web_sales

  25. web_site

For TPC-DS queries that are compatible with this generated schema and data, see the athena-tpcds/src/main/resources/queries/ directory on GitHub.

Example query

The following SELECT query example queries the tpcds catalog for customer demographics in specific counties.

SELECT cd_gender, cd_marital_status, cd_education_status, count(*) cnt1, cd_purchase_estimate, count(*) cnt2, cd_credit_rating, count(*) cnt3, cd_dep_count, count(*) cnt4, cd_dep_employed_count, count(*) cnt5, cd_dep_college_count, count(*) cnt6 FROM "lambda:tpcds".tpcds1.customer c, "lambda:tpcds".tpcds1.customer_address ca, "lambda:tpcds".tpcds1.customer_demographics WHERE c.c_current_addr_sk = ca.ca_address_sk AND ca_county IN ('Rush County', 'Toole County', 'Jefferson County', 'Dona Ana County', 'La Porte County') AND cd_demo_sk = c.c_current_cdemo_sk AND exists(SELECT * FROM "lambda:tpcds".tpcds1.store_sales, "lambda:tpcds".tpcds1.date_dim WHERE c.c_customer_sk = ss_customer_sk AND ss_sold_date_sk = d_date_sk AND d_year = 2002 AND d_moy BETWEEN 1 AND 1 + 3) AND (exists(SELECT * FROM "lambda:tpcds".tpcds1.web_sales, "lambda:tpcds".tpcds1.date_dim WHERE c.c_customer_sk = ws_bill_customer_sk AND ws_sold_date_sk = d_date_sk AND d_year = 2002 AND d_moy BETWEEN 1 AND 1 + 3) OR exists(SELECT * FROM "lambda:tpcds".tpcds1.catalog_sales, "lambda:tpcds".tpcds1.date_dim WHERE c.c_customer_sk = cs_ship_customer_sk AND cs_sold_date_sk = d_date_sk AND d_year = 2002 AND d_moy BETWEEN 1 AND 1 + 3)) GROUP BY cd_gender, cd_marital_status, cd_education_status, cd_purchase_estimate, cd_credit_rating, cd_dep_count, cd_dep_employed_count, cd_dep_college_count ORDER BY cd_gender, cd_marital_status, cd_education_status, cd_purchase_estimate, cd_credit_rating, cd_dep_count, cd_dep_employed_count, cd_dep_college_count LIMIT 100

Required Permissions

Review the Policies section of the athena-tpcds.yaml file for full details on the IAM policies that this connector requires. The following is a brief summary.

  • Amazon S3 write access – To successfully handle large queries, the connector requires write access to a location in Amazon S3.

  • Athena GetQueryExecution – The connector uses this permission to fast-fail when the upstream Athena query has terminated.

Performance

The Athena TPC-DS connector attempts to parallelize queries based on the scale factor that you choose. Predicate pushdown is performed within the Lambda function.

License information

The Amazon Athena TPC-DS connector project is licensed under the Apache-2.0 License.

See also

For additional information about this connector, visit the corresponding site on GitHub.com.