Best practice 3.5 – Record data classifications into the Data Catalog so that analytics workloads can understand
Allow processes to update the Data Catalog so it can provide a reliable record of where the data is located and its precise classification. To protect the data effectively, analytics systems should know the classifications of the source data so that the systems can govern the data according to business needs. For example, if the business requires that confidential data be encrypted using team-owned private keys, such as from AWS Key Management Service (AWS KMS), then the analytics workload should be able to determine which data is classified as confidential by referencing its data catalog.
Suggestion 3.5.1 – Use tags to indicate the data classifications
Use a tagging ontology to designate the classification of sensitive data in data stores with a data catalog. A tagging ontology allows discoverability of data sensitivity without directly exposing the underlying data. They also can be used to authorize access in tag-based access control (TBAC) schemes.
For more details, refer to the following information:
-
AWS Lake Formation Developer Guide: What Is AWS Lake Formation?
-
AWS Whitepaper: Tagging Best Practices
-
AWS Lake Formation: Easily manage your data lake at scale using AWS Lake Formation Tag-based access control
Suggestion 3.5.2 – Record lineage of data to track changes in the Data Catalog
Data lineage is a relation among data and the processing systems. For example, the data lineage tells where the source system of the data has come from, what changes occurred to the data, and which downstream systems have access to it. Your organization should be able to discover, record, and visualize the data lineage from source to target systems.
For more details, refer to the following information: