AWS Glue Data Catalog views - Amazon Redshift

AWS Glue Data Catalog views

This topic describes how to create views in the AWS Glue Data Catalog. You can use views in the Data Catalog to access data in different data sources using the same schema.

By creating views in the Data Catalog, you can create a single common view schema and metadata object to use across engines such as Amazon Athena and Amazon EMR Spark. Doing so lets you use the same views across your data lakes and data warehouses to fit your use cases. Views in the Data Catalog are special in that they are categorized as definer views, where access permissions are defined by the user who created the view instead of the user querying the view. The following are some use cases and benefits of creating views in the Data Catalog:

  • Create a view that restricts data access based on the permissions the user needs. For example, you can use views in the Data Catalog to prevent employees who don’t work in the HR department from seeing personally identifiable information (PII).

  • Make sure that users can’t access incomplete records. By applying certain filters onto your view in the Data Catalog, you make sure that data records inside a view in the Data Catalog are always complete.

  • Data Catalog views have an included security benefit of making sure that the query definition used to create the view must complete to create the view. This security benefit means that views in the Data Catalog are not susceptible to SQL commands from malicious players.

  • Views in the Data Catalog support the same advantages as normal views, such as letting users access a view without making the underlying table available to users.

To create a view in the Data Catalog, you must have a Spectrum external table, an object that’s contained within a Lake Formation-managed datashare, or an Apache Iceberg table.

Definitions of Data Catalog views are stored in the AWS Glue Data Catalog. Use AWS Lake Formation to grant access through resource grants, column grants, or tag-based access controls. For more information about granting and revoking access in Lake Formation, see Granting and revoking permissions on Data Catalog resources.

Prerequisites

Before you can create a view in the Data Catalog, make sure that you have the following prerequisites completed:

  • Make sure that your IAM role has the following trust policy.

    { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Service": [ "glue.amazonaws.com", "lakeformation.amazonaws.com" ] }, "Action": "sts:AssumeRole" } ] }
  • You also need the following pass role policy.

    { "Version": "2012-10-17", "Statement": [ { "Sid": "Stmt1", "Action": [ "iam:PassRole" ], "Effect": "Allow", "Resource": "*", "Condition": { "StringEquals": { "iam:PassedToService": [ "glue.amazonaws.com", "lakeformation.amazonaws.com" ] } } } ] }
  • Finally, you also need the following permissions.

    • Glue:GetDatabase

    • Glue:GetDatabases

    • Glue:CreateTable

    • Glue:GetTable

    • Glue:UpdateTable

    • Glue:DeleteTable

    • Glue:GetTables

    • Glue:SearchTables

    • Glue:BatchGetPartition

    • Glue:GetPartitions

    • Glue:GetPartition

    • Glue:GetTableVersion

    • Glue:GetTableVersions

End-to-end example

Start by creating an external schema based on your Data Catalog database.

CREATE EXTERNAL SCHEMA IF NOT EXISTS external_schema FROM DATA CATALOG DATABASE 'external_data_catalog_db' IAM_ROLE 'arn:aws:iam::123456789012:role/sample-role';

You can now create a Data Catalog view.

CREATE EXTERNAL PROTECTED VIEW external_schema.remote_view AS SELECT * FROM external_schema.remote_table;

You can then start querying your view.

SELECT * FROM external_schema.remote_view;

For more information about the SQL commands related to views in the Data Catalog, see CREATE EXTERNAL VIEW, ALTER EXTERNAL VIEW, and DROP EXTERNAL VIEW.

Secure logging

Redshift masks metadata logged in the Redshift system logs when a query references a multi-dialect glue view. Multi-dialect means that the view supports the SQL dialects of different query engines, such as Redshift and Amazon EMR. Data in the following tables is masked for all queries with the same query id. The following table lists system views and columns with secure logging applied.

System table Sensitive columns
SYS_EXTERNAL_QUERY_DETAIL

Columns: source_type, total_partitions, qualified_partitions, scanned_files, returned_rows, returned_bytes, file_format, file_location, external_query_text, warning_message. For more information, see SYS_EXTERNAL_QUERY_DETAIL.

SYS_EXTERNAL_QUERY_ERROR

Columns: file_location, rowid, column_name, original_value, modified_value, trigger, action, action_value, error_code. For more information, see SYS_EXTERNAL_QUERY_ERROR.

SYS_QUERY_DETAIL

Columns: step_name, table_id, table_name, input_bytes, input_rows, output_bytes, output_rows, blocks_read, blocks_write, local_read_IO, remote_read_IO, spilled_block_local_disk, spilled_block_remote_disk. For more information, see SYS_QUERY_DETAIL.

SYS_QUERY_HISTORY

Columns: returned_rows, returned_bytes. For more information, see SYS_QUERY_HISTORY.

STL_AGGR

Columns: rows, bytes, tbl, type. For more information, see STL_AGGR.

STL_BCAST

Columns: rows, bytes, packets. For more information, see STL_BCAST.

STL_DDLTEXT

Columns: text. For more information, see STL_DDLTEXT.

STL_DELETE

Columns: rows, tbl. For more information, see STL_DELETE.

STL_DIST

Columns: rows, bytes, packets. For more information, see STL_DIST.

STL_EXPLAIN

Columns: plannode, info. For more information, see STL_EXPLAIN.

STL_HASH

Columns: rows, bytes, tbl, est_rows. For more information, see STL_HASH.

STL_HASHJOIN

Columns: rows, tbl, num_parts, join_type. For more information, see STL_HASHJOIN.

STL_INSERT

Columns: rows, tbl. For more information, see STL_INSERT.

STL_LIMIT

Columns: rows. For more information, see STL_LIMIT.

STL_MERGE

Columns: rows. For more information, see STL_MERGE.

STL_MERGEJOIN

Columns: rows, tbl. For more information, see STL_MERGEJOIN.

STL_NESTLOOP

Columns: rows, tbl. For more information, see STL_NESTLOOP.

STL_PARSE

Columns: rows. For more information, see STL_PARSE.

STL_PLAN_INFO

Columns: rows, bytes. For more information, see STL_PLAN_INFO.

STL_PROJECT

Columns: rows, tbl. For more information, see STL_PROJECT.

STL_QUERY

Columns: querytxt. For more information, see STL_QUERY.

STL_QUERY_METRICS

Columns: max_rows, rows, max_blocks_read, blocks_read, max_blocks_to_disk, blocks_to_disk, max_query_scan_size, query_scan_size. For more information, see STL_QUERY_METRICS.

STL_QUERYTEXT

Columns: text. For more information, see STL_QUERYTEXT.

STL_RETURN

Columns: rows, bytes. For more information, see STL_RETURN.

STL_SAVE

Columns: rows, bytes, tbl. For more information, see STL_SAVE.

STL_SCAN

Columns: rows, bytes, fetches, type, tbl, rows_pre_filter, perm_table_name, scanned_mega_value. For more information, see STL_SCAN.

STL_SORT

Columns: rows, bytes, tbl. For more information, see STL_SORT.

STL_TR_CONFLICT

Columns: table_id. For more information, see STL_TR_CONFLICT.

STL_UNDONE

Columns: table_id. For more information, see STL_UNDONE.

STL_UNIQUE

Columns: rows, type, bytes. For more information, see STL_UNIQUE.

STL_UTILITYTEXT

Columns: text. For more information, see STL_UTILITYTEXT.

STL_WINDOW

Columns: rows. For more information, see STL_WINDOW.

STV_BLOCKLIST

Columns: col, tbl, num_values, minvalue, maxvalue. For more information, see STV_BLOCKLIST.

STV_EXEC_STATE

Columns: rows, bytes, label. For more information, see STV_EXEC_STATE.

STV_LOCKS

Columns: table_id. For more information, see STV_LOCKS.

STV_QUERY_METRICS

Columns: rows, max_rows, blocks_read, max_blocks_read, max_blocks_to_disk, blocks_to_disk, max_query_scan_size, query_scan_size. For more information, see STV_QUERY_METRICS.

STV_STARTUP_RECOVERY_STATE

Columns: table_id, table_name. For more information, see STV_STARTUP_RECOVERY_STATE.

STV_TBL_PERM

Columns: id, rows, sorted_rows, temp, block_count, query_scan_size. For more information, see STV_TBL_PERM.

STV_TBL_TRANS

Columns: id, rows, size. For more information, see STV_TBL_TRANS.

SVCS_EXPLAIN

Columns: plannode, info. For more information, see SVCS_EXPLAIN.

SVCS_PLAN_INFO

Columns: rows, bytes. For more information, see SVCS_PLAN_INFO.

SVCS_QUERY_SUMMARY

Columns: step, rows, bytes, rate_row, rate_byte, label, rows_pre_filter. For more information, see SVCS_QUERY_SUMMARY.

SVCS_S3LIST

Columns: bucket, prefix, max_file_size, avg_file_size. For more information, see SVCS_QUERY_SUMMARY.

SVCS_S3LOG

Columns: message. For more information, see SVCS_QUERY_SUMMARY.

SVCS_S3PARTITION_SUMMARY

Columns: total_partitions, qualified_partitions, min_assigned_partitions, max_assigned_partitions, avg_assigned_partitions. For more information, see SVCS_S3PARTITION_SUMMARY.

SVCS_S3QUERY_SUMMARY

Columns: external_table_name, file_format, s3_scanned_rows, s3_scanned_bytes, s3query_returned_rows, s3query_returned_bytes. For more information, see SVCS_S3QUERY_SUMMARY.

SVL_QUERY_METRICS

Columns: step_label, scan_row_count, join_row_count, nested_loop_join_row_count, return_row_count, spectrum_scan_row_count, spectrum_scan_size_mb. For more information, see SVL_QUERY_METRICS.

SVL_QUERY_METRICS_SUMMARY

Columns: step_label, scan_row_count, join_row_count, nested_loop_join_row_count, return_row_count, spectrum_scan_row_count, spectrum_scan_size_mb. For more information, see SVL_QUERY_METRICS_SUMMARY.

SVL_QUERY_REPORT

Columns: rows, bytes, label, rows_pre_filter. For more information, see SVL_QUERY_REPORT.

SVL_QUERY_SUMMARY

Columns: rows, bytes, rows_pre_filter. For more information, see SVL_QUERY_SUMMARY.

SVL_S3LIST

Columns: bucket, prefix, max_file_size, avg_file_size. For more information, see SVL_S3LIST.

SVL_S3LOG

Columns: message. For more information, see SVL_S3LOG.

SVL_S3PARTITION

Columns: rows, bytes, label, rows_pre_filter. For more information, see SVL_S3PARTITION.

SVL_S3PARTITION_SUMMARY

Columns: total_partitions, qualified_partitions, min_assigned_partitions, max_assigned_partitions, avg_assigned_partitions. For more information, see SVL_S3PARTITION_SUMMARY.

SVL_S3QUERY

Columns: external_table_name, file_format, s3_scanned_rows, s3_scanned_bytes, s3query_returned_rows, s3query_returned_bytes. For more information, see SVL_S3QUERY.

SVL_S3QUERY_SUMMARY

Columns: external_table_name, file_format, s3_scanned_rows, s3_scanned_bytes, s3query_returned_rows, s3query_returned_bytes. For more information, see SVL_S3QUERY_SUMMARY.

SVL_S3RETRIES

Columns: file_size, location, message. For more information, see SVL_S3RETRIES.

SVL_SPECTRUM_SCAN_ERROR

Columns: location, rowid, colname, original_value, modified_value. For more information, see SVL_SPECTRUM_SCAN_ERROR.

SVL_STATEMENTTEXT

Columns: type, text. For more information, see SVL_STATEMENTTEXT.

SVL_STORED_PROC_CALL

Columns: querytxt. For more information, see SVL_STORED_PROC_CALL.

SVL_STORED_PROC_MESSAGES

Columns: querytext. For more information, see SVL_STORED_PROC_MESSAGES.

SVL_UDF_LOG

Columns: funcname. For more information, see SVL_UDF_LOG.

SVV_DISKUSAGE

Columns: name, col, tbl, blocknum, num_values, minvalue, maxvalue. For more information, see SVV_DISKUSAGE.

SVV_QUERY_STATE

Columns: rows, bytes, label. For more information, see SVV_QUERY_STATE.

SVV_TABLE_INFO

Columns: table_id, table. For more information, see SVV_TABLE_INFO.

SVV_TRANSACTIONS

Columns: relation. For more information, see SVV_TRANSACTIONS.

Considerations and limitations

The following are considerations and limitations that apply to views created in the Data Catalog.

  • You can’t create a Data Catalog view that is based off of another view.

  • You can only have 10 base tables in a Data Catalog view.

  • The definer of the view must have full SELECT GRANTABLE permissions on the base tables.

  • Views can only contain Lake Formation objects and built-ins. The following objects are not permitted inside of a view.

    • System tables

    • User-defined functions (UDFs)

    • Redshift tables, views, materialized views, and late binding views that aren’t in a Lake Formation managed data share.

  • Views can’t contain nested Redshift Spectrum tables.

  • AWS Glue representations of the base objects of a view must be in the same AWS account and Region as the view.