AWS Glue Data Catalog views
This topic describes how to create views in the AWS Glue Data Catalog. You can use views in the Data Catalog to access data in different data sources using the same schema.
By creating views in the Data Catalog, you can create a single common view schema and metadata object to use across engines such as Amazon Athena and Amazon EMR Spark. Doing so lets you use the same views across your data lakes and data warehouses to fit your use cases. Views in the Data Catalog are special in that they are categorized as definer views, where access permissions are defined by the user who created the view instead of the user querying the view. The following are some use cases and benefits of creating views in the Data Catalog:
-
Create a view that restricts data access based on the permissions the user needs. For example, you can use views in the Data Catalog to prevent employees who don’t work in the HR department from seeing personally identifiable information (PII).
-
Make sure that users can’t access incomplete records. By applying certain filters onto your view in the Data Catalog, you make sure that data records inside a view in the Data Catalog are always complete.
-
Data Catalog views have an included security benefit of making sure that the query definition used to create the view must complete to create the view. This security benefit means that views in the Data Catalog are not susceptible to SQL commands from malicious players.
-
Views in the Data Catalog support the same advantages as normal views, such as letting users access a view without making the underlying table available to users.
To create a view in the Data Catalog, you must have a Spectrum external table, an object that’s contained within a Lake Formation-managed datashare, or an Apache Iceberg table.
Definitions of Data Catalog views are stored in the AWS Glue Data Catalog. Use AWS Lake Formation to grant access through resource grants, column grants, or tag-based access controls. For more information about granting and revoking access in Lake Formation, see Granting and revoking permissions on Data Catalog resources.
Prerequisites
Before you can create a view in the Data Catalog, make sure that you have the following prerequisites completed:
Make sure that your IAM role has the following trust policy.
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Service": [ "glue.amazonaws.com", "lakeformation.amazonaws.com" ] }, "Action": "sts:AssumeRole" } ] }
-
You also need the following pass role policy.
{ "Version": "2012-10-17", "Statement": [ { "Sid": "Stmt1", "Action": [ "iam:PassRole" ], "Effect": "Allow", "Resource": "*", "Condition": { "StringEquals": { "iam:PassedToService": [ "glue.amazonaws.com", "lakeformation.amazonaws.com" ] } } } ] }
Finally, you also need the following permissions.
Glue:GetDatabase
Glue:GetDatabases
Glue:CreateTable
Glue:GetTable
Glue:UpdateTable
Glue:DeleteTable
Glue:GetTables
Glue:SearchTables
Glue:BatchGetPartition
Glue:GetPartitions
Glue:GetPartition
Glue:GetTableVersion
Glue:GetTableVersions
End-to-end example
Start by creating an external schema based on your Data Catalog database.
CREATE EXTERNAL SCHEMA IF NOT EXISTS external_schema FROM DATA CATALOG DATABASE 'external_data_catalog_db' IAM_ROLE 'arn:aws:iam::123456789012:role/sample-role';
You can now create a Data Catalog view.
CREATE EXTERNAL PROTECTED VIEW external_schema.remote_view AS SELECT * FROM external_schema.remote_table;
You can then start querying your view.
SELECT * FROM external_schema.remote_view;
For more information about the SQL commands related to views in the Data Catalog, see CREATE EXTERNAL VIEW, ALTER EXTERNAL VIEW, and DROP EXTERNAL VIEW.
Secure logging
Redshift masks metadata logged in the Redshift system logs when a query references a multi-dialect glue view. Multi-dialect means that the view supports the SQL dialects of different query engines, such as Redshift and Amazon EMR. Data in the following tables is masked for all queries with the same query id. The following table lists system views and columns with secure logging applied.
System table | Sensitive columns |
---|---|
SYS_EXTERNAL_QUERY_DETAIL | Columns: source_type, total_partitions, qualified_partitions, scanned_files, returned_rows, returned_bytes, file_format, file_location, external_query_text, warning_message. For more information, see SYS_EXTERNAL_QUERY_DETAIL. |
SYS_EXTERNAL_QUERY_ERROR | Columns: file_location, rowid, column_name, original_value, modified_value, trigger, action, action_value, error_code. For more information, see SYS_EXTERNAL_QUERY_ERROR. |
SYS_QUERY_DETAIL | Columns: step_name, table_id, table_name, input_bytes, input_rows, output_bytes, output_rows, blocks_read, blocks_write, local_read_IO, remote_read_IO, spilled_block_local_disk, spilled_block_remote_disk. For more information, see SYS_QUERY_DETAIL. |
SYS_QUERY_HISTORY | Columns: returned_rows, returned_bytes. For more information, see SYS_QUERY_HISTORY. |
STL_AGGR | Columns: rows, bytes, tbl, type. For more information, see STL_AGGR. |
STL_BCAST | Columns: rows, bytes, packets. For more information, see STL_BCAST. |
STL_DDLTEXT | Columns: text. For more information, see STL_DDLTEXT. |
STL_DELETE | Columns: rows, tbl. For more information, see STL_DELETE. |
STL_DIST | Columns: rows, bytes, packets. For more information, see STL_DIST. |
STL_EXPLAIN | Columns: plannode, info. For more information, see STL_EXPLAIN. |
STL_HASH | Columns: rows, bytes, tbl, est_rows. For more information, see STL_HASH. |
STL_HASHJOIN | Columns: rows, tbl, num_parts, join_type. For more information, see STL_HASHJOIN. |
STL_INSERT | Columns: rows, tbl. For more information, see STL_INSERT. |
STL_LIMIT | Columns: rows. For more information, see STL_LIMIT. |
STL_MERGE | Columns: rows. For more information, see STL_MERGE. |
STL_MERGEJOIN | Columns: rows, tbl. For more information, see STL_MERGEJOIN. |
STL_NESTLOOP | Columns: rows, tbl. For more information, see STL_NESTLOOP. |
STL_PARSE | Columns: rows. For more information, see STL_PARSE. |
STL_PLAN_INFO | Columns: rows, bytes. For more information, see STL_PLAN_INFO. |
STL_PROJECT | Columns: rows, tbl. For more information, see STL_PROJECT. |
STL_QUERY | Columns: querytxt. For more information, see STL_QUERY. |
STL_QUERY_METRICS | Columns: max_rows, rows, max_blocks_read, blocks_read, max_blocks_to_disk, blocks_to_disk, max_query_scan_size, query_scan_size. For more information, see STL_QUERY_METRICS. |
STL_QUERYTEXT | Columns: text. For more information, see STL_QUERYTEXT. |
STL_RETURN | Columns: rows, bytes. For more information, see STL_RETURN. |
STL_SAVE | Columns: rows, bytes, tbl. For more information, see STL_SAVE. |
STL_SCAN | Columns: rows, bytes, fetches, type, tbl, rows_pre_filter, perm_table_name, scanned_mega_value. For more information, see STL_SCAN. |
STL_SORT | Columns: rows, bytes, tbl. For more information, see STL_SORT. |
STL_TR_CONFLICT | Columns: table_id. For more information, see STL_TR_CONFLICT. |
STL_UNDONE | Columns: table_id. For more information, see STL_UNDONE. |
STL_UNIQUE | Columns: rows, type, bytes. For more information, see STL_UNIQUE. |
STL_UTILITYTEXT | Columns: text. For more information, see STL_UTILITYTEXT. |
STL_WINDOW | Columns: rows. For more information, see STL_WINDOW. |
STV_BLOCKLIST | Columns: col, tbl, num_values, minvalue, maxvalue. For more information, see STV_BLOCKLIST. |
STV_EXEC_STATE | Columns: rows, bytes, label. For more information, see STV_EXEC_STATE. |
STV_LOCKS | Columns: table_id. For more information, see STV_LOCKS. |
STV_QUERY_METRICS | Columns: rows, max_rows, blocks_read, max_blocks_read, max_blocks_to_disk, blocks_to_disk, max_query_scan_size, query_scan_size. For more information, see STV_QUERY_METRICS. |
STV_STARTUP_RECOVERY_STATE | Columns: table_id, table_name. For more information, see STV_STARTUP_RECOVERY_STATE. |
STV_TBL_PERM | Columns: id, rows, sorted_rows, temp, block_count, query_scan_size. For more information, see STV_TBL_PERM. |
STV_TBL_TRANS | Columns: id, rows, size. For more information, see STV_TBL_TRANS. |
SVCS_EXPLAIN | Columns: plannode, info. For more information, see SVCS_EXPLAIN. |
SVCS_PLAN_INFO | Columns: rows, bytes. For more information, see SVCS_PLAN_INFO. |
SVCS_QUERY_SUMMARY | Columns: step, rows, bytes, rate_row, rate_byte, label, rows_pre_filter. For more information, see SVCS_QUERY_SUMMARY. |
SVCS_S3LIST | Columns: bucket, prefix, max_file_size, avg_file_size. For more information, see SVCS_QUERY_SUMMARY. |
SVCS_S3LOG | Columns: message. For more information, see SVCS_QUERY_SUMMARY. |
SVCS_S3PARTITION_SUMMARY | Columns: total_partitions, qualified_partitions, min_assigned_partitions, max_assigned_partitions, avg_assigned_partitions. For more information, see SVCS_S3PARTITION_SUMMARY. |
SVCS_S3QUERY_SUMMARY | Columns: external_table_name, file_format, s3_scanned_rows, s3_scanned_bytes, s3query_returned_rows, s3query_returned_bytes. For more information, see SVCS_S3QUERY_SUMMARY. |
SVL_QUERY_METRICS | Columns: step_label, scan_row_count, join_row_count, nested_loop_join_row_count, return_row_count, spectrum_scan_row_count, spectrum_scan_size_mb. For more information, see SVL_QUERY_METRICS. |
SVL_QUERY_METRICS_SUMMARY | Columns: step_label, scan_row_count, join_row_count, nested_loop_join_row_count, return_row_count, spectrum_scan_row_count, spectrum_scan_size_mb. For more information, see SVL_QUERY_METRICS_SUMMARY. |
SVL_QUERY_REPORT | Columns: rows, bytes, label, rows_pre_filter. For more information, see SVL_QUERY_REPORT. |
SVL_QUERY_SUMMARY | Columns: rows, bytes, rows_pre_filter. For more information, see SVL_QUERY_SUMMARY. |
SVL_S3LIST | Columns: bucket, prefix, max_file_size, avg_file_size. For more information, see SVL_S3LIST. |
SVL_S3LOG | Columns: message. For more information, see SVL_S3LOG. |
SVL_S3PARTITION | Columns: rows, bytes, label, rows_pre_filter. For more information, see SVL_S3PARTITION. |
SVL_S3PARTITION_SUMMARY | Columns: total_partitions, qualified_partitions, min_assigned_partitions, max_assigned_partitions, avg_assigned_partitions. For more information, see SVL_S3PARTITION_SUMMARY. |
SVL_S3QUERY | Columns: external_table_name, file_format, s3_scanned_rows, s3_scanned_bytes, s3query_returned_rows, s3query_returned_bytes. For more information, see SVL_S3QUERY. |
SVL_S3QUERY_SUMMARY | Columns: external_table_name, file_format, s3_scanned_rows, s3_scanned_bytes, s3query_returned_rows, s3query_returned_bytes. For more information, see SVL_S3QUERY_SUMMARY. |
SVL_S3RETRIES | Columns: file_size, location, message. For more information, see SVL_S3RETRIES. |
SVL_SPECTRUM_SCAN_ERROR | Columns: location, rowid, colname, original_value, modified_value. For more information, see SVL_SPECTRUM_SCAN_ERROR. |
SVL_STATEMENTTEXT | Columns: type, text. For more information, see SVL_STATEMENTTEXT. |
SVL_STORED_PROC_CALL | Columns: querytxt. For more information, see SVL_STORED_PROC_CALL. |
SVL_STORED_PROC_MESSAGES | Columns: querytext. For more information, see SVL_STORED_PROC_MESSAGES. |
SVL_UDF_LOG | Columns: funcname. For more information, see SVL_UDF_LOG. |
SVV_DISKUSAGE | Columns: name, col, tbl, blocknum, num_values, minvalue, maxvalue. For more information, see SVV_DISKUSAGE. |
SVV_QUERY_STATE | Columns: rows, bytes, label. For more information, see SVV_QUERY_STATE. |
SVV_TABLE_INFO | Columns: table_id, table. For more information, see SVV_TABLE_INFO. |
SVV_TRANSACTIONS | Columns: relation. For more information, see SVV_TRANSACTIONS. |
Considerations and limitations
The following are considerations and limitations that apply to views created in the Data Catalog.
You can’t create a Data Catalog view that is based off of another view.
You can only have 10 base tables in a Data Catalog view.
The definer of the view must have full
SELECT GRANTABLE
permissions on the base tables.Views can only contain Lake Formation objects and built-ins. The following objects are not permitted inside of a view.
System tables
User-defined functions (UDFs)
Redshift tables, views, materialized views, and late binding views that aren’t in a Lake Formation managed data share.
Views can’t contain nested Redshift Spectrum tables.
AWS Glue representations of the base objects of a view must be in the same AWS account and Region as the view.