Data in AWS Data Exchange - AWS Data Exchange User Guide

Data in AWS Data Exchange

Data is organized in AWS Data Exchange using three building blocks:

  • Assets – a piece of data that can be stored as an Amazon S3 object.

  • Revisions – a container for one or more assets.

  • Datasets – a series of one or more revisions.

These three building blocks form the foundation of the product that you manage using the AWS Data Exchange console or the AWS Data Exchange APIs.

You can use the AWS Data Exchange console, AWS CLI, your own REST client, or one of the AWS SDKs to create, view, update, or delete datasets. For more information about programmatically managing AWS Data Exchange datasets, see the AWS Data Exchange API Reference.

Assets

Assets are the data in AWS Data Exchange. Each asset is a snapshot of an Amazon S3 object, with a maximum size of 10 GB. You can use the console, or programmatically through the AWS CLI, your own REST application, or one of the AWS SDKs to create or copy assets through jobs.

A dataset owner can both import and export, but someone with an entitlement to a dataset can only export.

Asset structure

Assets have the following parameters:

  • DataSetId – The ID of the dataset that contains this asset.

  • RevisionId – The ID of the revision that contains this asset.

  • Id – A unique ID generated when the asset is created.

  • Arn – A unique identifier for an AWS resource name.

  • CreatedAt and UpdatedAt – Date and timestamps for the creation and last update of the asset.

  • AssetDetails – Information about the asset, including its size.

  • AssetType – Currently, the only type of asset available is a snapshot of an Amazon S3 object.

Example Asset Resource

{ "Name": "automation/cloudformation.yaml", "Arn": "arn:aws:dataexchange:us-east-1::data-sets/29EXAMPLE24b82c6858af3cEXAMPLEcf/revisions/bbEXAMPLE74c02f4745c660EXAMPLE20/assets/baEXAMPLE660c9fe7267966EXAMPLEf5", "Id": "baEXAMPLE660c9fe7267966EXAMPLEf5", "CreatedAt": "2019-10-17T21:31:29.833Z", "UpdatedAt": "2019-10-17T21:31:29.833Z", "AssetType": "S3_SNAPSHOT", "RevisionId": "bbEXAMPLE74c02f4745c660EXAMPLE20", "DataSetId": "29EXAMPLE24b82c6858af3cEXAMPLEcf", "AssetDetails": { "S3SnapshotAsset": { "Size": 9423 } } }

Revisions

A revision is a container for one or more assets. For example, a collection of .csv files or a single .csv file and a dictionary are grouped to create a revision. As new data is available, you create revisions and add assets.

When you create a revision and finalize the revision that belongs to a dataset in a published product, that revision will be immediately available to subscribers.

You can create and finalize revisions using the AWS Data Exchange console. For more information, see Publishing a new product.

Important

Beginning July 22, 2021, new and existing providers have the ability to automatically publish revisions to datasets. All new products on AWS Data Exchange default to automatic revision publishing. If you have created existing products on AWS Data Exchange before July 22, 2021, you need to migrate them to automatic revision publishing.

For more information, see Migrating an existing product to automatic revision publishing.

Note

If you are an existing provider and have not yet migrated all of your products to automatic revision publishing, you can create, add, and publish revisions using the AWS Data Exchange console or the AWS Marketplace Catalog API.

If you choose the API, use the StartChangeSet AWS Marketplace Catalog API action. Revisions are uniquely identified by their ARN. For more information, see Using AWS Data Exchange with the AWS Marketplace Catalog API.

Keep the following in mind:

  • To be finalized, a revision must contain at least one asset.

  • It is your responsibility to ensure that the assets are correct before you finalize your revision.

  • A finalized revision published to at least one product cannot be unfinalized or changed in any way.

  • After the revision is finalized, it is automatically published to your products.

Revision structure

Revisions have the following parameters:

  • DataSetId – The ID of the dataset that contains this revision.

  • Comment – A comment about the revision. This field can be 128 characters long.

  • Finalized – Either true or false. Used to indicate whether the revision is finalized.

  • Id – The unique identifier for the revision generated when it's created.

  • Arn – A unique identifier for an AWS resource name.

  • CreatedAt and UpdatedAt – Date and timestamps for the creation and last update of the revision. Entitled revisions are created at the time of publishing.

Example Revision Resource

{ "UpdatedAt": "2019-10-11T14:13:31.749Z", "DataSetId": "1EXAMPLE404460dc9b005a0d9EXAMPLE2f", "Comment": "initial data revision", "Finalized": true, "Id": "e5EXAMPLE224f879066f9999EXAMPLE42", "Arn": "arn:aws:dataexchange:us-east-1:123456789012:data-sets/1EXAMPLE404460dc9b005a0d9EXAMPLE2f/revisions/e5EXAMPLE224f879066f9999EXAMPLE42", "CreatedAt": "2019-10-11T14:11:58.064Z" }

Datasets

A dataset is a collection of data that can change over time. It contains a series of one or more revisions. When you access a dataset, you're typically accessing a specific revision in the dataset. This structure enables providers to change the data available in datasets over time without having to worry about changes to historical data.

You can use the AWS Data Exchange console, AWS CLI, your own REST client, or one of the AWS SDKs to create, view, update, or delete datasets. For more information about programmatically managing AWS Data Exchange datasets, see the AWS Data Exchange API Reference

Owned datasets

A dataset is owned by the account that created it. Owned datasets can be identified using the origin parameter, which is set to OWNED.

Entitled datasets

Entitled datasets are a read-only view of a provider's owned datasets. Entitled datasets are created at time of product publishing and are made available to subscribers who have an active subscription to the product. Entitled datasets can be identified using the origin parameter, which is set to ENTITLED.

As a data subscriber, you can view and interact with your entitled datasets using the AWS Data Exchange APIs, or in the Console.

As a data provider, you also have access to the entitled dataset view that your subscribers see. You can do so using the AWS Data Exchange APIs, or by choosing the dataset name in the product page in the AWS Data Exchange console.

AWS Regions and datasets

Your datasets can be in any supported AWS Region, but all datasets in a single product must be in the same AWS Region.

Tags

You can add tags to your owned datasets and their revisions. When you use tagging, you can also use tag-based access control in IAM policies to control access to these datasets and revisions.

Entitled datasets can't be tagged. Tags of owned datasets and their revisions are not propagated to their corresponding entitled versions. Specifically, subscribers, who have read-only access to entitled datasets and revisions, won't see the tags of the original owned dataset.

Note

Currently, assets and jobs don't support tagging.

Dataset structure

Datasets have the following parameters:

  • Name – The name of the dataset. This value can be up to 256 characters long.

  • Description – A description for the dataset. This value can be up to 16,348 characters long.

  • AssetType – Defines the type of assets the dataset contains. Currently, the only supported asset type is snapshots of Amazon S3 objects.

  • Origin – A property that defines the dataset as Owned by the account (for providers) or Entitled to the account (for subscribers).

  • Id – An ID that uniquely identifies the dataset. Dataset IDs are generated at dataset creation. Entitled datasets have a different ID than the original owned dataset.

  • Arn – A unique identifier for an AWS resource name.

  • Created at and UpdatedAt – Date and timestamps for the creation and last update of the dataset.

Note

As a provider, you can change some properties for owned datasets, like the Name or Description. Updating properties in an owned dataset won't update the properties in the corresponding entitled dataset.

Example Data Set Resource

{ "Origin": "OWNED", "AssetType": "S3_SNAPSHOT", "Name": "MyDataSetName", "CreatedAt": "2019-09-09T19:31:49.704Z", "UpdatedAt": "2019-09-09T19:31:49.704Z", "Id": "fEXAMPLE1fd9a5c8b0d2e6fEXAMPLEe1", "Arn": "arn:aws:dataexchange:us-east-2:123456789109:data-sets/fEXAMPLE1fd9a5c8b0d2e6fEXAMPLEe1", "Description": "This is my dataset's description that describes the contents of the dataset." }

Dataset best practices

As a provider, when you create and update datasets, keep the following best practices in mind:

  • The name of the dataset is visible in the product details in the catalog. We recommend that you choose a concise, descriptive name so customers easily understand the content of the dataset.

  • The description is visible to subscribers who have an active subscription to the product. We recommend that you include coverage information and the features and benefits of the dataset.