This whitepaper is for historical reference only. Some content might be outdated and some links might not be available.
Cataloging and search layer
A data lake typically hosts many datasets which have evolving schema and new data partitions. A central data catalog that manages metadata for all the datasets in the data lake is crucial to enabling self-service discovery of data in the data lake. Additionally, separating metadata from data into a central schema enables schema-on-read for the processing and consumption layer components.
In the presented architecture, Lake Formation provides the central
catalog to store and manage metadata for all datasets hosted in the
data lake. Organizations manage both technical metadata (such as
versioned table schemas, partitioning information, physical data
location, and update timestamps) and business attributes (such as
data owner, data steward, column business definition, and column
information sensitivity) of all their datasets in Lake Formation.
Services such as AWS Glue,
Amazon EMR
Lake Formation provides the data lake administrator a central place to set up granular table and column level permissions for databases and tables hosted in the data lake. After Lake Formation permissions are set up, users and groups can access only authorized tables and columns using multiple processing and consumption layer services such as Athena, Amazon EMR, AWS Glue, and Amazon Redshift Spectrum.