Data format guidelines - Amazon Personalize

Data format guidelines

When you import data into Amazon Personalize datasets, you can choose to import records in bulk, individually, or both.

  • Bulk imports involve importing a large number of historical records at once. You can prepare and import your bulk data with SageMaker Data Wrangler and multiple data sources. Or you can prepare bulk data yourself, and import it directly into Amazon Personalize from a CSV file in Amazon S3.

  • With individual imports, you import individual records with the Amazon Personalize console and API operations. Or you can stream interactions data from live events in real time. For more information about individual imports, see Importing individual records.

Before you import your bulk data, make sure it's properly formatted. The following sections can help you format your bulk data. If you're not sure how to format your data, you can use Amazon SageMaker Data Wrangler (Data Wrangler) to prepare your data. For more information, see Preparing and importing data using Amazon SageMaker Data Wrangler.

Bulk data format guidelines and requirements

The following guidelines and requirements can help you make sure your bulk data is formatted correctly.

  • Your input data must be in a CSV (comma-separated values) file.

  • The first row of your CSV file must contain your column headers. Don't enclose headers in quotation marks (").

  • Make sure you have the required fields for your dataset type and make sure that their names align with Amazon Personalize requirements. For example, your Items data might have a column called ITEM_IDENTIFICATION_NUMBER with IDs for each of your items. To use this column as an ITEM_ID field, rename the column to ITEM_ID. If you use Data Wrangler to format your data, you can use the Map columns for Amazon Personalize Data Wrangler transform to make sure your columns are named correctly.

    For information about required fields, see Schemas. For information about using Data Wrangler to prepare your data, see Preparing and importing data using Amazon SageMaker Data Wrangler.

  • The column header names in your CSV file must map to your schema.

  • Each record in your CSV file must be on a single line.

  • The data types in each column must map to your schema. If you use Data Wrangler to format your data, you can use the Data Wrangler transform Parse Value as Type to convert the data types.

  • TIMESTAMP and CREATION_TIMESTAMP data must be in UNIX epoch time format. For more information, see Timestamp data.

  • Avoid including any " characters or special characters in item ID, user ID, and action ID data.

  • If your data includes any non-ASCII encoded characters, your CSV file must be encoded in UTF-8 format.

  • Makes sure you format any textual data as described in Unstructured text metadata.

  • Make sure you format impression data and categorical data as described in Formatting explicit impressions and Formatting categorical data.

Interactions data example

The following interactions data represents historical user activity from a website that sells movie tickets. You might use the data to train a model that provides movie recommendations based on users' interaction data.

USER_ID,ITEM_ID,EVENT_TYPE,EVENT_VALUE,TIMESTAMP 196,242,click,15,881250949 186,302,click,13,891717742 22,377,click,10,878887116 244,51,click,20,880606923 166,346,click,10,886397596 298,474,click,40,884182806 115,265,click,20,881171488 253,465,click,50,891628467 305,451,click,30,886324817

Here's associated Interactions schema:.

{ "type": "record", "name": "Interactions", "namespace": "com.amazonaws.personalize.schema", "fields": [ { "name": "USER_ID", "type": "string" }, { "name": "ITEM_ID", "type": "string" }, { "name": "EVENT_TYPE", "type": "string" }, { "name": "EVENT_VALUE", "type": "float" }, { "name": "TIMESTAMP", "type": "long" } ], "version": "1.0" }

Amazon Personalize requires the USER_ID, ITEM_ID, and TIMESTAMP fields. USER_ID is the identifier for a user of your application. ITEM_ID is the identifier for a movie. EVENT_TYPE and EVENT_VALUE are the identifiers for user activities. In the sample data, a click might represent a movie purchase event and 15 might be the purchase price of the movie. TIMESTAMP represents the Unix epoch time that the movie purchase took place.

Timestamp data

Timestamp data, such as TIMESTAMP (for Item interactions datasets) or CREATION_TIMESTAMP (for Items datasets) data, must be in Unix epoch time format in seconds. For example, the Epoch timestamp in seconds for date July 31, 2020 is 1596238243. To convert dates to Unix epoch timestamps use an Epoch converter - Unix timestamp converter.

Formatting explicit impressions

If you use the User-Personalization recipe, you can record and upload impressions data. Impressions are lists of items that were visible to a user when interacting with a particular item (for example, clicked or watched). To upload impressions data in a bulk data import, manually record each item ID. Be sure to separate the values with a vertical bar, '|', character as part of your historical interactions data. The vertical bar character counts toward the 1000 character limit for impressions data. For more information on impressions data, see Impressions data.

The following is a short excerpt from an Item interactions dataset that includes explicit impressions in the IMPRESSION column.









... ... ... ... ...

The application showed user USER_1 items 73, 70, 17, 95, and 96 and the user ultimately chose item 73. When you create a new solution version based on this data, items 70, 17, 95, and 96 will be less frequently recommended to user USER_1.

Formatting categorical data

To include multiple categories for a single item when you use categorical string data, separate the values using the vertical bar, '|', character. For example, for an item that has two categories, a data row would resemble the following:

ITEM_ID,GENRE item_123,horror|comedy

After you format your data, upload it to an Amazon S3 bucket so you can import it into Amazon Personalize. For more information, see Uploading to an Amazon S3 bucket.