Centralized catalog - AWS Prescriptive Guidance

Centralized catalog

The following diagram shows how the centralized catalog connects data producers and data consumers in the data lake.

The centralized catalog connects data producers and data consumers in the data lake.

The centralized catalog stores and manages the shared data catalog for the data producer accounts. The centralized catalog also hosts the shared data's technical metadata (for example, table name and schema) and is the location where data consumers come to access data.

Data consumers can access data from multiple data producers in the centralized catalog and can then mix this data with their own data for further processing. Using a centralized catalog removes the need for data consumers to directly connect with different data producers and reduces operational overhead.

Because the centralized catalog has visibility into data sharing and data consumption by data producers and consumers, it can be an ideal location to apply your centralized data governance functions (for example, access auditing).

The following sections describe how the centralized catalog uses AWS Lake Formation and AWS Glue.

AWS Lake Formation

AWS Lake Formation helps create databases in an AWS Glue Data Catalog that point to the locations of multiple data producers in your data lake. An AWS Identity and Access Management (IAM) role is created for Lake Formation in the centralized catalog. By using Lake Formation, the centralized catalog can selectively share data resources (for example, database, tables, or columns) with data consumers. The Lake Formation managed resources are shared with data consumers by using one of the following two methods:

  • Named resource method – This method shares managed resources across accounts. Databases, tables, or column names must be specified and a resource can be shared to an organization, organizational unit (OU), or AWS account. To reduce the sharing and management overhead, we recommend that you share resources at higher levels where possible (for example, in an organization or OU instead of an AWS account) . However, you must make sure that this approach meets your organization's data security control requirements.

    • Note: This method works well for data consumers with an application type, where AWS services consume data from the data producer. The data access requirement from this type of data consumer is application-driven, prescriptive, and relatively static.

  • Lake Formation tag-based access control (LF-TBAC) method – LF-TBAC is particularly useful for data consumers with a data-serving type. However, Lake Formation tagged resources can currently only be shared at the AWS account level and not at the organization or OU level.

AWS Glue

You must create databases in AWS Glue for each data producer in your centralized catalog. Because the centralized catalog uses AWS Glue to host databases from all data producers, you must make sure that the database name is unique across all data producers and that it reflects the data producer and their type of data. For example, you can use the following database naming structure: <Data_Producer>–<Environment>–<Data_Group>

  • <Data_Producer> – The data producer’s name.

  • <Environment> – The data lake environment, such as dev for a development environment, sit for a system integration test environment, or prod for a production environment.

  • <Data_Group> – The name of the data group that is used to separate data from a data producer into logical groups. You can use the source system name, ID, or abbreviation as the name. A database description also helps to describe the content and purpose of the database.

You can use an AWS Glue crawler on the data producer’s data to maintain its schema in the centralized catalog's database. If data is regularly created on the same frequency by a data producer, you can use a single AWS Glue crawler. In all other cases, you should use multiple AWS Glue crawlers to accommodate different crawling frequencies. Depending on your business use case, the crawler can either be scheduled for a predefined frequency or initiated by events.

You can also maintain table schema in AWS Glue by calling the AWS Glue API to create or update the schema. Although this can provide flexibility, additional effort is required for code development and maintenance. Make sure that you evaluate the use case and business value and then choose the option that meets your requirements and has the least overhead.