Preparing and importing bulk data - Amazon Personalize

Preparing and importing bulk data

When you have created a dataset, you are ready to start importing your bulk historical data into Amazon Personalize. You have two choices for importing your bulk records:

  • For Item interactions, Users, and Items datasets, you can use Amazon SageMaker Data Wrangler to import your data from 40+ sources, generate visualizations and Amazon Personalize specific insights, and transform it to meet Amazon Personalize requirements.

  • For all dataset types, you can import bulk data directly into datasets. When you import directly, you manually format your data to meet Amazon Personalize requirements and upload it to Amazon S3. Then you create a schema and a dataset, and import the data directly into the dataset with a dataset import job.

The following guidelines can help you make sure your bulk data is formatted correctly.

  • Your input data must be in a CSV (comma-separated values) file.

  • The first row of your CSV file must contain your column headers. Don't enclose headers in quotation marks (").

  • Make sure you have the required fields for your dataset type and make sure that their names align with Amazon Personalize requirements. For example, your Items data might have a column called ITEM_IDENTIFICATION_NUMBER with IDs for each of your items. To use this column as an ITEM_ID field, rename the column to ITEM_ID. If you use Data Wrangler to format your data, you can use the Map columns for Amazon Personalize Data Wrangler transform to make sure your columns are named correctly.

    For information about required fields, see Schemas. For information about using Data Wrangler to prepare your data, see Preparing and importing data using Amazon SageMaker Data Wrangler.

  • The column header names in your CSV file must map to your schema.

  • Each record in your CSV file must be on a single line.

  • The data types in each column must map to your schema. If you use Data Wrangler to format your data, you can use the Data Wrangler transform Parse Value as Type to convert the data types.

  • TIMESTAMP and CREATION_TIMESTAMP data must be in UNIX epoch time format. For more information, see Timestamp data.

  • If your data includes any non-ASCII encoded characters, your CSV file must be encoded in UTF-8 format.

  • Makes sure you format any textual data as described in Unstructured text metadata.

  • Make sure you format impression data and categorical data as described in Formatting explicit impressions and Formatting categorical data.

For more information about bulk data formatting requirements for Amazon Personalize, see Data format guidelines.

After you import data into an Amazon Personalize dataset, you can analyze it, export it to an Amazon S3 bucket, update it, or delete it by deleting the dataset. For more information, see Managing data.

If you already created a recommender or deployed a custom solution version with a campaign, how new bulk records influence recommendations depends on the domain use case or recipe that you use. For more information, see How new data influences real-time recommendations.

Filter updates for bulk records

Within 20 minutes of completing a bulk import, Amazon Personalize updates any filters you created in the dataset group with your new item and user data. This update allows Amazon Personalize to use the most recent data when filtering recommendations for your users.