Build an enterprise data mesh with Amazon DataZone, AWS CDK, and AWS CloudFormation
Created by Dhrubajyoti Mukherjee (AWS), Adjoa Taylor (AWS), Ravi Kumar (AWS), and Weizhou Sun (AWS)
Code repository: Data mesh with Amazon DataZone | Environment: Production | Technologies: Analytics; Business productivity; Databases; Modernization; Multi account strategy |
AWS services: Amazon Athena; AWS CDK; AWS CloudFormation; Amazon DataZone; AWS Glue; AWS Identity and Access Management; Amazon QuickSight; Amazon S3 |
Summary
On Amazon Web Services (AWS), customers understand that data is the key to accelerating innovation and driving business value for their enterprise. To manage this massive data, you can adopt a decentralized architecture such as data mesh. A data mesh architecture facilitates product thinking, a mindset that takes customers, goals, and the market into account. Data mesh also helps to establish a federated governance model that provides fast, secure access to your data.
Strategies for building a data mesh-based enterprise solution on AWS discusses how you can use the Data Mesh Strategy Framework to formulate and implement a data mesh strategy for your organization. By using the Data Mesh Strategy Framework, you can optimize the organization of teams and their interactions to accelerate your data mesh journey.
This document provides guidance on how to build an enterprise data mesh with Amazon DataZone. Amazon DataZone is a data management service for cataloging, discovering, sharing, and governing data stored across AWS, on premises, and third-party sources. The pattern includes code artifacts that help you deploy the data mesh‒based data solution infrastructure using AWS Cloud Development Kit (AWS CDK) and AWS CloudFormation. This pattern is intended for cloud architects and DevOps engineers.
For information about the objectives of this pattern and the solution scope, see the Additional information section.
Prerequisites and limitations
Prerequisites
A minimum of two active AWS accounts: one for the central governance account and another for the member account
AWS administrator credentials for the central governance account in your development environment
AWS Command Line Interface (AWS CLI) installed to manage your AWS services from the command line
Node.js and Node Package Manager (npm) installed
to manage AWS CDK applications AWS CDK Toolkit installed globally in your development environment by using npm, to synthesize and deploy AWS CDK applications
npm install -g aws-cdk
Python version 3.12 installed in your development environment
TypeScript installed in your development environment or installed globally by using npm compiler:
npm install -g typescript
Docker installed in your development environment
A version control system such as Git to maintain the source code of the solution (recommended)
An integrated development environment (IDE) or text editor with support for Python and TypeScript (strongly recommended)
Limitations
The solution has been tested only on machines that are running Linux or macOS.
In the current version, the solution doesn’t support the integration of Amazon DataZone and AWS IAM Identity Center by default. However, you can configure it to support this integration.
Product versions
Python version 3.12
Architecture
The following diagram shows a data mesh reference architecture. The architecture is based on Amazon DataZone and uses Amazon Simple Storage Service (Amazon S3) and AWS Glue Data Catalog as data sources. The AWS services that you use with Amazon DataZone in your data mesh implementation might differ, based on your organization's requirements.
In the producer accounts, raw data is either fit for consumption in its current form or it’s transformed for consumption by using AWS Glue. The technical metadata for the data is stored in Amazon S3 and is evaluated using a AWS Glue data crawler. The data quality is measured by using AWS Glue Data Quality. The source database in the Data Catalog is registered as an asset in the Amazon DataZone catalog. The Amazon DataZone catalog is hosted in the central governance account using Amazon DataZone data source jobs.
The central governance account hosts the Amazon DataZone domain and the Amazon DataZone data portal. The AWS accounts of the data producers and consumers are associated with the Amazon DataZone domain. The Amazon DataZone projects of the data producers and consumers are organized under the corresponding Amazon DataZone domain units.
End users of the data assets log into the Amazon DataZone data portal by using their AWS Identity and Access Management (IAM) credentials or single sign-on (with integration through IAM Identity Center). They search, filter, and view asset information (for example, data quality information or business and technical metadata) in the Amazon DataZone data catalog.
After an end user finds the data asset that they want, they use the Amazon DataZone subscription feature to request access. The data owner on the producer team receives a notification and evaluates the subscription request in the Amazon DataZone data portal. The data owner approves or rejects the subscription request based on its validity.
After the subscription request is granted and fulfilled, the asset is accessed in the consumer account for the following activities:
AI/ML model development by using Amazon SageMaker
Analytics and reporting by using Amazon Athena and Amazon QuickSight
Tools
AWS services
Amazon Athena is an interactive query service that helps you analyze data directly in Amazon Simple Storage Service (Amazon S3) by using standard SQL.
AWS Cloud Development Kit (AWS CDK) is a software development framework that helps you define and provision AWS Cloud infrastructure in code.
AWS CloudFormation helps you set up AWS resources, provision them quickly and consistently, and manage them throughout their lifecycle across AWS accounts and AWS Regions.
Amazon DataZone is a data management service that helps you catalog, discover, share, and govern data stored across AWS, on premises, and in third-party sources.
Amazon QuickSight is a cloud-scale business intelligence (BI) service that helps you visualize, analyze, and report your data in a single dashboard.
Amazon SageMaker is a managed machine learning (ML) service that helps you build and train ML models and then deploy them into a production-ready hosted environment.
Amazon Simple Storage Service (Amazon S3) is a cloud-based object storage service that helps you store, protect, and retrieve any amount of data.
Amazon Simple Queue Service (Amazon SQS) provides a secure, durable, and available hosted queue that helps you integrate and decouple distributed software systems and components.
Amazon Simple Storage Service (Amazon S3) is a cloud-based object storage service that helps you store, protect, and retrieve any amount of data.
Code repository
The solution is available in the GitHub data-mesh-datazone-cdk-cloudformation
Epics
Task | Description | Skills required |
---|---|---|
Clone the repository. | To clone the repository, run the following command in your local development environment (Linux or macOS):
| Cloud architect, DevOps engineer |
Create the environment. | To create the Python virtual environment, run the following commands:
| Cloud architect, DevOps engineer |
Bootstrap the account. | To bootstrap the central governance account by using AWS CDK, run the following command:
Sign in to the AWS Management Console, open the central governance account console, and get the Amazon Resource Name (ARN) of the AWS CDK execution role. | Cloud architect, DevOps engineer |
Construct the | To construct the
| Cloud architect, DevOps engineer |
Confirm template creation. | Ensure that the AWS CloudFormation template file is created at the | Cloud architect, DevOps engineer |
Task | Description | Skills required |
---|---|---|
Modify the configuration. | In the
Keep the remaining parameters empty. | Cloud architect, DevOps engineer |
Update the Amazon DataZone glossary configuration. | To update the Amazon DataZone glossary configuration in the
| Cloud architect, DevOps engineer |
Update the Amazon DataZone metadata form configuration. | To update the Amazon DataZone metadata form configuration in the
| Cloud architect, DevOps engineer |
Export the AWS credentials. | To export AWS credentials to your development environment for the IAM role with administrative permissions, use the following format:
| Cloud architect, DevOps engineer |
Synthesize the template. | To synthesize the AWS CloudFormation template, run the following command:
| Cloud architect, DevOps engineer |
Deploy the solution. | To deploy the solution, run the following command:
| Cloud architect, DevOps engineer |
Task | Description | Skills required |
---|---|---|
Deploy the template. | Deploy the AWS CloudFormation template located at
| Cloud architect, DevOps engineer |
Update the ARNs. | To update the list of AWS CloudFormation StackSet execution role ARNs for the member accounts, use the following code:
| Cloud architect, DevOps engineer |
Synthesize and deploy. | To synthesize the AWS CloudFormation template and deploy the solution, run the following commands:
| Cloud architect, DevOps engineer |
Associate the member account. | To associate the member account with the central governance account, do the following:
| Cloud architect, DevOps engineer |
Update the parameters. | To update the member account‒specific parameters in the config file at
| Cloud architect, DevOps engineer |
Synthesize and deploy the template. | To synthesize the AWS CloudFormation template and deploy the solution, run the following commands:
| Cloud architect, DevOps engineer |
Add member accounts. | To create and configure additional member accounts in the data solution, repeat the previous steps for each member account. This solution doesn’t differentiate between data producers and consumers. | Cloud architect, DevOps engineer |
Task | Description | Skills required |
---|---|---|
Disassociate the member accounts. | To disassociate the accounts, do the following:
| Cloud architect, DevOps engineer |
Delete the stack instances. | To delete the AWS CloudFormation stack instances, do the following:
| Cloud architect, DevOps engineer |
Destroy all resources. | To destroy resources, implement the following steps in your local development environment (Linux or macOS):
| Cloud architect, DevOps engineer |
Related resources
Additional information
Objectives
Implementing this pattern achieves the following:
Decentralized ownership of data ‒ Shift data ownership from a central team to teams that represent the source systems, business units, or use cases of your organization.
Product thinking ‒ Introduce a product-based mindset that includes customers, the market, and other factors when considering the data assets in your organization.
Federated governance ‒ Improve security guardrails, controls, and compliance across your organization's data products.
Multi-account and multiple-project support ‒ Support efficient, secure data sharing and collaboration across the business units or projects of your organization.
Centralized monitoring and notifications ‒ Monitor the cloud resources of your data mesh by using Amazon CloudWatch, and notify users when a new member account is associated.
Scalability and extensibility ‒ Add new use cases into the data mesh as your organization evolves.
Solution scope
When you use this solution, you can start small and scale as you progress in your data mesh journey. Often, when a member account adopts the data solution, it contains account configurations specific to the organization, project, or business unit. This solution accommodates these diverse AWS account configurations by supporting the following features:
AWS Glue Data Catalog as the data source for Amazon DataZone
Management of the Amazon DataZone data domain and the related data portal
Management of adding member accounts in the data mesh‒based data solution
Management of Amazon DataZone projects and environments
Management of Amazon DataZone glossaries and metadata forms
Management of IAM roles that correspond to the data mesh‒based data solution users
Notification of data mesh‒based data solution users
Monitoring of the provisioned cloud infrastructure
This solution uses AWS CDK and AWS CloudFormation to deploy the cloud infrastructure. It uses AWS CloudFormation to do the following:
Define and deploy cloud resources at a lower level of abstraction.
Deploy cloud resources from the AWS Management Console. By using this approach, you can deploy infrastructure without a development environment.
The data mesh solution uses AWS CDK to define resources at higher abstraction level. As a result, the solution provides a decoupled, modular, and scalable approach by choosing the relevant tool to deploy the cloud resources.
Next steps
You can reach out to AWS experts
The modular nature of this solution supports building data management solutions with different architectures, such as data fabric and data lakes. In addition, based on the requirements of your organization, you can extend the solution to other Amazon DataZone data sources.