Set up a minimum viable data space to share data between organizations
Created by Ramy Hcini (Think-it), Ismail Abdellaoui (Think-it), Malte Gasseling (Think-it), Jorge Hernandez Suarez (AWS), and Michael Miller (AWS)
Environment: PoC or pilot | Technologies: Analytics; Containers & microservices; Data lakes; Databases; Infrastructure | Workload: Open-source |
AWS services: Amazon Aurora; AWS Certificate Manager (ACM); AWS CloudFormation; Amazon EC2; Amazon EFS; Amazon EKS; Elastic Load Balancing (ELB); Amazon RDS; Amazon S3; AWS Systems Manager |
Summary
Data spaces are federated networks for data exchange with trust and control over one's data as core principles. They enable organizations to share, exchange, and collaborate on data at scale by offering a cost-effective and technology-agnostic solution.
Data spaces have the potential to significantly drive efforts for a sustainable future by using data-driven problem solving with an end-to-end approach that involves all relevant stakeholders.
This pattern guides you through the example of how two companies can use data space technology on Amazon Web Services (AWS) to drive their carbon emissions‒reduction strategy forward. In this scenario, company X provides carbon-emissions data, which company Y consumes. See the Additional information section for the following data space specification details:
Participants
Business case
Data space authority
Data space components
Data space services
Data to be exchanged
Data model
Tractus-X EDC connector
The pattern includes steps for the following:
Deploying the infrastructure needed for a basic data space with two participants running on AWS.
Exchanging carbon emissions‒intensity data by using the connectors in a secure way.
This pattern deploys a Kubernetes cluster that will host data space connectors and their services through Amazon Elastic Kubernetes Service (Amazon EKS).
The Eclipse Dataspace Components (EDC)
In addition, the identity service is deployed on Amazon Elastic Compute Cloud (Amazon EC2) to replicate a real-life scenario of a minimum viable data space (MVDS).
Prerequisites and limitations
Prerequisites
An active AWS account to deploy the infrastructure in your chosen AWS Region
An AWS Identity and Access Management (IAM) user with access to Amazon S3 that will be used temporarily as a technical user (The EDC connector currently doesn't support the use of roles. We recommend that you create one IAM user specifically for this demo and that this user will have limited permissions associated with it.)
AWS Command Line Interface (AWS CLI) installed and configured in your chosen AWS Region
eksctl
on your workstation Git
on your workstation An AWS Certificate Manager (ACM) SSL/TLS certificate
A DNS name that will point to an Application Load Balancer (the DNS name must be covered by the ACM certificate)
HashiCorp Vault
(For information about using AWS Secrets Manager to manage secrets, see the Additional information section.)
Product versions
Limitations
Connector selection ‒ This deployment uses an EDC-based connector. However, be sure to consider the strengths and functionalities of both the EDC
and FIWARE True connectors to make an informed decision that aligns with the specific needs of the deployment. EDC connector build ‒ The chosen deployment solution relies on the Tractus-X EDC Connector
Helm chart, a well-established and extensively tested deployment option. The decision to use this chart is driven by its common usage and the inclusion of essential extensions in the provided build. While PostgreSQL and HashiCorp Vault are default components, you have the flexibility to customize your own connector build if needed. Private cluster access ‒ Access to the deployed EKS cluster is restricted to private channels. Interaction with the cluster is performed exclusively through the use of
kubectl
and IAM. Public access to the cluster resources can be enabled by using load balancers and domain names, which must be implemented selectively to expose specific services to a broader network. However, we do not recommend providing public access.Security focus ‒ Emphasis is placed on abstracting security configurations to default specifications so that you can concentrate on the steps involved in EDC connector data exchange. Although default security settings are maintained, it's imperative to enable secure communications before you expose the cluster to the public network. This precaution ensures a robust foundation for secure data handling.
Infrastructure cost ‒ An estimation of the infrastructure’s cost can be found by using the AWS Pricing Calculator
. A simple calculation shows that costs can be up to 162.92 USD per month for the deployed infrastructure.
Architecture
The MVDS architecture comprises two virtual private clouds (VPCs), one for the Dynamic Attribute Provisioning System (DAPS) identity service and one for Amazon EKS.
DAPS architecture
The following diagram shows DAPS running on EC2 instances controlled by an Auto Scaling group. An Application Load Balancer and route table expose the DAPS servers. Amazon Elastic File System (Amazon EFS) synchronizes the data among the DAPS instances.
Amazon EKS architecture
Data spaces are designed to be technology-agnostic solutions, and multiple implementations exist. This pattern uses an Amazon EKS cluster to deploy the data space technical components. The following diagram shows the deployment of the EKS cluster. Worker nodes are installed in private subnets. The Kubernetes pods access the Amazon Relational Database Service (Amazon RDS) for PostgreSQL instance that is also in the private subnets. The Kubernetes pods store shared data in Amazon S3.
Tools
AWS services
AWS CloudFormation helps you set up AWS resources, provision them quickly and consistently, and manage them throughout their lifecycle across AWS accounts and Regions.
Amazon Elastic Compute Cloud (Amazon EC2) provides scalable computing capacity in the AWS Cloud. You can launch as many virtual servers as you need and quickly scale them up or down.
Amazon Elastic File System (Amazon EFS) helps you create and configure shared file systems in the AWS Cloud.
Amazon Elastic Kubernetes Service (Amazon EKS) helps you run Kubernetes on AWS without needing to install or maintain your own Kubernetes control plane or nodes.
Amazon Simple Storage Service (Amazon S3) is a cloud-based object storage service that helps you store, protect, and retrieve any amount of data.
Elastic Load Balancing (ELB) distributes incoming application or network traffic across multiple targets. For example, you can distribute traffic across EC2 instances, containers, and IP addresses in one or more Availability Zones.
Other tools
eksctl is a command-line utility for creating and managing Kubernetes clusters on Amazon EKS.
Git
is an open source, distributed version control system. HashiCorp Vault
provides secure storage with controlled access for credentials and other sensitive information. Helm
is an open source package manager for Kubernetes that helps you install and manage applications on your Kubernetes cluster. kubectl
is a command-line interface that helps you run commands against Kubernetes clusters. Postman
is an API platform.
Code repository
The Kubernetes configuration YAML files and Python scripts for this pattern are available in the GitHub aws-patterns-edc
Best practices
Amazon EKS and isolation of participants’ infrastructures
Namespaces in Kubernetes will separate the company X provider’s infrastructure from the company Y consumer’s infrastructure in this pattern. For more information, see EKS Best Practices Guides
In a more realistic situation, each participant would have a separate Kubernetes cluster running within their own AWS account. The shared infrastructure (DAPS in this pattern) would be accessible by data space participants while being completely separated from participants' infrastructures.
Epics
Task | Description | Skills required |
---|---|---|
Clone the repository. | To clone the repository to your workstation, run the following command:
The workstation must have access to your AWS account. | DevOps engineer |
Provision the Kubernetes cluster and set up namespaces. | To deploy a simplified default EKS cluster in your account, run the following
The command creates the VPC and private and public subnets that span three different Availability Zones. After the network layer is created, the command creates two For more information and example output, see the eksctl guide After you provision the private cluster, add the new EKS cluster to your local Kubernetes configuration by running the following command:
This pattern uses the To confirm that your EKS nodes are running and are in the ready state, run the following command:
| DevOps engineer |
Set up the namespaces. | To create namespaces for the provider and the consumer, run the following commands:
In this pattern, it's important to use | DevOps engineer |
Task | Description | Skills required |
---|---|---|
Deploy DAPS by using AWS CloudFormation. | For ease of managing DAPS operations, the DAPS server is installed on EC2 instances. To install DAPS, use the AWS CloudFormation template
You can deploy the AWS CloudFormation template by signing in to the AWS Management Console and using the AWS CloudFormation console
The environment name is your own choice. We recommend using a meaningful term, such as For this pattern, The template deploys the EC2 instances in private subnets. This means that the instances are not directly accessible through SSH (Secure Shell) from the internet. The instances are provisioned with the necessary IAM role and AWS Systems Manager Agent to enable access to the running instances through Session Manager, a capability of AWS Systems Manager. We recommend using Session Manager for access. Alternatively, you could provision a bastion host to allow SSH access from the internet. When using the bastion host approach, the EC2 instance might take a few more minutes to start running. After the AWS CloudFormation template is successfully deployed, point the DNS name to your Application Load Balancer DNS name. To confirm, run the following command:
The output should be similar to the following:
| DevOps engineer |
Register the participants’ connectors to the DAPS service. | From within any of the EC2 instances provisioned for DAPS, register participants:
The choice of the names doesn't impact the next steps. We recommend using either The registration commands will also automatically configure the DAPS service with the needed information fetched from the created certificates and keys. While you are logged in to a DAPS server, gather information needed for later steps in the installation:
We recommend copying and pasting the text into similarly named files prefixed with You should have the client IDs for the provider and consumer and should have four files in your working directory on your workstation:
| DevOps engineer |
Task | Description | Skills required |
---|---|---|
Clone the Tractus-X EDC repository and use the 0.4.1 version. | The Tractus-X EDC connector’s build requires PostgreSQL (assets database) and HashiCorp Vault (secrets management) services to be deployed and available. There are many different versions of Tractus-X EDC Helm charts. This pattern specifies version 0.4.1 because it uses the DAPS server. The latest versions use Managed Identity Wallet (MIW) with a distributed implementation of the identity service. On the workstation where you created the two Kubernetes namespaces, clone the tractusx-edc repository
| DevOps engineer |
Configure the Tractus-X EDC Helm chart. | Modify the Tractus-X Helm chart template configuration to enable both connectors to interact together. To do this, you would add the namespace to the DNS name of the service so that it could be resolved by other services in the cluster. These modifications should be made to the Make sure to comment all DAPS dependencies in
| DevOps engineer |
Configure the connectors to use PostgreSQL on Amazon RDS. | (Optional) Amazon Relational Database Service (Amazon RDS) instance is not required in this pattern. However, we highly recommend using Amazon RDS or Amazon Aurora because they provide features such as high availability and backup and recovery. To replace PostgreSQL on Kubernetes with Amazon RDS, do the following:
| DevOps engineer |
Configure and deploy the provider connector and its services. | To configure the provider connector and its services, do the following:
| DevOps engineer |
Add the certificate and keys to the provider vault. | To avoid confusion, produce the following certificates outside of the For example, run the following command to change to your home directory:
You now need to add the secrets that are needed by the provider into the vault. The names of the secrets within the vault are the values of the keys in the
An Advanced Encryption Standard (AES) key, private key, public key, and self-signed certificate are generated initially. These are subsequently added as secrets to the vault. Furthermore, this directory should contain the
You should now be able to access the vault through your browser or the CLI. Browser
Vault CLI The CLI will also use the port forward that you configured.
| DevOps engineer |
Configure and deploy the consumer connector and its services. | The steps for configuring and deploying the consumer are similar to those you completed for the provider:
| |
Add the certificate and keys to the consumer vault. | From a security standpoint, we recommend regenerating the certificates and keys for each data space participant. This pattern regenerates certificates and keys for the consumer. The steps are very similar to those for the provider. You can verify the secret names in the The names of the secrets within the vault are the values of the keys in the
The
The local port is 8201 this time so that you can have port forwards in place for both the producer and consumer. Browser You can use your browser to connect to http://localhost:8201/ The secrets and files that contain the content are the following:
Vault CLI Using Vault CLI, you can run the following commands to log in to the vault and create the secrets:
| DevOps engineer |
Task | Description | Skills required |
---|---|---|
Set up port forwarding. |
The cluster is private and is not accessible publicly. To interact with the connectors, use the Kubernetes port-forwarding feature to forward traffic generated by your machine to the connector control plane.
| DevOps engineer |
Create S3 buckets for the provider and the consumer. | The EDC connector currently doesn't use temporary AWS credentials, such as those provided by assuming a role. The EDC supports only the use of an IAM access key ID and secret access key combination. Two S3 buckets are required for later steps. One S3 bucket is used for storing data made available by the provider. The other S3 bucket is for data received by the consumer. The IAM user should have permission to read and write objects only in the two named buckets. An access key ID and secret access key pair needs to be created and kept safe. After this MVDS has been decommissioned, the IAM user should be deleted. The following code is an example IAM policy for the user:
| DevOps engineer |
Set up Postman to interact with the connector. | You can now interact with the connectors through your EC2 instance. Use Postman as an HTTP client, and provide Postman Collections for both the provider and the consumer connectors. Import the collections This pattern uses Postman collection variables to provide input to your requests. | App developer, Data engineer |
Task | Description | Skills required |
---|---|---|
Prepare the carbon-emissions intensity data to be shared. | First you need to decide on the data asset to be shared. The data of company X represents the carbon-emissions footprint of its vehicle fleet. Weight is Gross Vehicle Weight (GVW) in tonnes, and emissions are in grams of CO2 per tonne-kilometer (g CO2 e/t-km) according to the Wheel-to-Well (WTW) measurement:
The example data is in the Company X uses Amazon S3 to store objects. Create the S3 bucket and store the example data object there. The following commands create an S3 bucket with default security settings. We highly recommend consulting Security best practices for Amazon S3.
The S3 bucket name should be globally unique. For more information about naming rules, see the AWS documentation. | Data engineer, App developer |
Register the data asset to the provider’s connector by using Postman. | An EDC connector data asset holds the name of the data and its location. In this case, the EDC connector data asset will point to the created object in the S3 bucket:
| App developer, Data engineer |
Define the usage policy of the asset. | An EDC data asset must be associated with clear usage policies. First, create the Policy Definition in the provider connector. The policy of company X is to allow participants of the data space to use the carbon-emissions footprint data.
| App developer, Data engineer |
Define an EDC Contract Offer for the asset and its usage policy. | To allow other participants to request access to your data, offer it in a contract that specifies the usage conditions and permissions:
| App developer, Data engineer |
Task | Description | Skills required |
---|---|---|
Request the data catalog shared by company X. | As a data consumer in the data space, company Y first needs to discover the data that is being shared by other participants. In this basic setup, you can do this by asking the consumer connector to request the catalog of available assets from the provider connector directly.
| App developer, Data engineer |
Initiate a contract negotiation for the carbon-emissions intensity data from company X. | Now that you have identified the asset that you want to consume, initiate a contract negotiation process between the consumer and provider connectors.
The process might take some time before reaching the VERIFIED state. You can check the state of the Contract Negotiation and the corresponding Agreement ID by using the | App developer, Data engineer |
Task | Description | Skills required |
---|---|---|
Consume data from HTTP endpoints. | (Option 1) To use HTTP data plane to consume data in the data space, you can use webhook.site
In this last step, you must send the request to the consumer data plane (forward ports properly), as stated in the payload ( | App developer, Data engineer |
Consume data from S3 buckets directly. | (Option 2) Use Amazon S3 integration with the EDC connector, and directly point to the S3 bucket in the consumer infrastructure as a destination:
| Data engineer, App developer |
Troubleshooting
Issue | Solution |
---|---|
The connector might raise an issue about the certificate PEM format. | Concatenate the contents of each file to a single line by adding |
Related resources
Building data spaces for sustainability use cases (AWS Prescriptive Guidance strategy by Think-it
) Enabling data sharing through data spaces and AWS
(blog post)
Additional information
Data space specifications
Participants
Participant | Description of the company | Focus of the company |
Company X | Operates a fleet of vehicles across Europe and South America to transport various goods. | Aims to make data-driven decisions to reduce its carbon-emissions footprint intensity. |
Company Y | An environmental regulatory authority | Enforces environmental regulations and policies designed to monitor and mitigate the environmental impact of businesses and industries, including carbon-emissions intensity. |
Business case
Company X uses data space technology to share carbon footprint data with a compliance auditor, company Y, to evaluate and address the environmental impact of company X’s logistics operations.
Data space authority
The data space authority is a consortium of the organizations governing the data space. In this pattern, both company X and company Y form the governance body and represent a federated data space authority.
Data space components
Component | Chosen implementation | Additional information |
Dataset exchange protocol | Dataspace Protocol version 0.8 | |
Data space connector | Tractus-X EDC Connector version 0.4.1 | |
Data exchange policies | Default USE Policy |
Data space services
Service | Implementation | Additional information |
Identity service | "A Dynamic Attribute Provisioning System (DAPS) has the intent to ascertain certain attributes to organizations and connectors. Hence, third parties do not need to trust the latter provided they trust the DAPS assertions." — DAPS To focus on the connector’s logic, the data space is deployed on an Amazon EC2 machine using Docker Compose. | |
Discovery service | "The Federated Catalogue constitutes an indexed repository of Gaia-X Self-Descriptions to enable the discovery and selection of Providers and their service offerings. The Self-Descriptions are the information given by Participants about themselves and about their services in the form of properties and claims." — Gaia-X Ecosystem Kickstarter |
Data to be exchanged
Data assets | Description | Format |
Carbon emissions data | Intensity values for different vehicle types in the specified region (Europe and South America) from the entire fleet of vehicles | JSON file |
Data model
{ "region": "string", "vehicles": [ // Each vehicle type has its Gross Vehicle Weight (GVW) category and its emission intensity in grams of CO2 per Tonne-Kilometer (g CO2 e/t-km) according to the "Well-to-Wheel" (WTW) measurement. { "type": "string", "gross_vehicle_weight": "string", "emission_intensity": { "CO2": "number", "unit": "string" } } ] }
Tractus-X EDC connector
For documentation of each Tractus-X EDC parameter, see the original values file
The following table lists all of services, along with their corresponding exposed ports and endpoints for reference.
Service name | Port and path |
Control plane | ● management: ‒ Port: 8081 Path: ● control ‒ Port: 8083 Path: ● protocol Port: 8084 Path: ● metrics ‒ Port: 9090 Path: ● observability ‒ Port: 8085 Path: |
Data plane | default ‒ Port: 8080 Path: public ‒ Port: 8081 Path: proxy ‒ Port: 8186 Path: metrics ‒ Port: 9090 Path: observability ‒ Port: 8085 Path: |
Vault | Port: 8200 |
PostgreSQL | Port: 5432 |
Using AWS Secrets Manager Manager
It's possible to use Secrets Manager instead of HashiCorp Vault as the secrets manager. To do so you, must use or build the AWS Secrets Manager EDC extension.
You will be responsible for creating and maintaining your own image, because Tractus-X doesn't provide support for Secrets Manager.
To accomplish that, you need to modify the build Gradle files of both the control plane
For more insights on refactoring the Tractus-X connector Docker image, see Refactor Tractus-X EDC Helm charts
For simplicity purposes, we avoid to rebuilding the connector image in this pattern and use HashiCorp Vault.