Preparing training data for Amazon Personalize - Amazon Personalize

Preparing training data for Amazon Personalize

After you choose a domain use case or recipe and note its data requirements, you are ready to start preparing your data. Amazon Personalize can use the following types of data:

  • Item interactions – In Amazon Personalize, an item interaction is a positive interaction event between a user and an item in your catalogue. For example, a user watching a movie, viewing a listing, or purchasing a pair of shoes.

  • Items – Item metadata might include information such as price, SKU type, description, or availability for each item in your catalog.

  • Users – User metadata might include information such as age, gender, loyalty membership, and interest for each of your users.

  • Actions – An action is an engagement activity that you might want to recommend to your customers. Actions might include installing your mobile app, completing a membership profile, joining your loyalty program, or signing up for promotional emails. For the Next-Best-Action recipe, the Actions dataset is required. No other custom recipe or domain use case uses Actions data.

  • Action interactions – An action interaction is an interaction event between a user and an action. The Next-Best-Action recipe uses this data and the data in your Actions dataset to recommend actions to your users. No other custom recipe or domain use case uses Action-interactions data.

Amazon Personalize stores data in datasets, one for each type of data. Each dataset has different requirements. When you import data into an Amazon Personalize dataset, you can choose to import records in bulk, individually, or both. Bulk imports involve importing a large number of historical records stored in one or more CSV files in an Amazon S3 bucket.

The following sections provide data requirements for each Amazon Personalize dataset type and guidelines for preparing bulk data. If you don't have bulk data, review the sections to understand the required and optional data you can import with individual import operations. If you need additional help formatting your data, you can use Amazon SageMaker Data Wrangler (Data Wrangler) to prepare your data. For more information, see Preparing and importing bulk data using Amazon SageMaker Data Wrangler.

After you finish preparing your data, you are ready to create a schema JSON file. This file tells Amazon Personalize about the structure of your data. For more information, see Creating schema JSON files for your data.

Bulk data format guidelines for all types of data

The following guidelines and requirements can help you make sure your bulk data is formatted correctly.

  • Your input data must be in a CSV (comma-separated values) file.

  • The first row of your CSV file must contain your column headers. Don't enclose headers in quotation marks (").

  • Columns must have unique alphanumeric names. For example, you can't add both a GENRES_FIELD_1 field and a GENRESFIELD1 field.

  • Make sure you have the required fields for your dataset type and make sure that their names align with Amazon Personalize requirements. For example, your Items data might have a column called ITEM_IDENTIFICATION_NUMBER with IDs for each of your items. To use this column as an ITEM_ID field, rename the column to ITEM_ID. If you use Data Wrangler to format your data, you can use the Map columns for Amazon Personalize Data Wrangler transform to make sure your columns are named correctly.

    For information about using Data Wrangler to prepare your data, see Preparing and importing bulk data using Amazon SageMaker Data Wrangler.

  • Each record in your CSV file must be on a single line.

  • Amazon Personalize doesn't support complex data types such as arrays and maps.

  • To have Amazon Personalize use boolean data when training or filtering, use string values "True" and "False" or numeric values 1 for true and 0 for false.

  • If you use Data Wrangler to format your data, you can use the Data Wrangler transform Parse Value as Type to convert the data types.

  • TIMESTAMP and CREATION_TIMESTAMP data must be in UNIX epoch time format. For more information, see Timestamp data.

  • Avoid including any " characters or special characters in item ID, user ID, and action ID data.

  • If your data includes any non-ASCII encoded characters, your CSV file must be encoded in UTF-8 format.

  • Makes sure you format any textual data as described in Unstructured text metadata.