Amazon Personalize
Developer Guide

The AWS Documentation website is getting a new look!
Try it now and let us know what you think. Switch to the new look >>

You can return to the original look by selecting English in the language selector above.

Datasets and Schemas

Amazon Personalize recognizes three types of historical datasets. Each type has an associated schema with a name key whose value matches the dataset type. The three types are:

  • Users – This dataset is intended to provide metadata about your users. This might include information such as age, gender, or loyalty membership, which can be important signals in personalization systems.

  • Items – This dataset is intended to provide metadata about your items. This might include information such as price, SKU type, or availability.

  • Interactions – This dataset is intended to provide historical interaction data between users and items.

The Users and Items dataset types are known as metadata types and are only used by certain recipes. For more information, see Using Predefined Recipes. For metadata datasets, all strings, except for USER_ID and ITEM_ID, must be marked as categorical in the schema, as shown in the following examples.

Note

A dataset group can contain only one of each type of dataset.

Each dataset has a set of required fields, reserved keywords, and their required datatypes, as shown in the following table.

Dataset Type Required Fields Reserved Keywords
Users

USER_ID (string)

1 metadata field

Items

ITEM_ID (string)

1 metadata field

Interactions

USER_ID (string)

ITEM_ID (string)

TIMESTAMP (long)

EVENT_TYPE (string)

EVENT_VALUE (float)

Before you add a dataset to Amazon Personalize, you must define a schema for that dataset. Each dataset type has specific requirements. Schemas in Amazon Personalize are defined in the Avro format. For more information, see Apache Avro.

When you create a schema, you must follow these guidelines:

  • The schema fields can be in any order, but they must match the order of the corresponding column headers in the data file.

  • Each dataset type requires specific fields in its schema. You must define the required fields with their required data types.

  • Some schemas have reserved keywords for field names. If you use a reserved keyword for a field name in your schema, you must define it as its required datatype.

  • The required fields and reserved keywords are not considered "metadata fields."

  • Added fields that are not required or don't use a reserved keyword are metadata. Metadata fields can be either a string or non-string type.

  • The users and items schemas require at least one metadata field.

  • If you add your own metadata field of type string, it must include the "categorical" attribute. Otherwise, you can’t use it to train a model.

  • A schema, and its related dataset, can contain up to five metadata fields.

The following example shows an Interactions schema. The EVENT_TYPE and EVENT_VALUE fields are optional, and are reserved keywords recognized by Amazon Personalize.

{ "type": "record", "name": "Interactions", "namespace": "com.amazonaws.personalize.schema", "fields": [ { "name": "USER_ID", "type": "string" }, { "name": "ITEM_ID", "type": "string" }, { "name": "EVENT_TYPE", "type": "string" }, { "name": "EVENT_VALUE", "type": "float" }, { "name": "TIMESTAMP", "type": "long" } ], "version": "1.0" }

The following example shows a Users schema in Avro format. Only the USER_ID field is required. The AGE and GENDER fields are metadata.

{ "type": "record", "name": "Users", "namespace": "com.amazonaws.personalize.schema", "fields": [ { "name": "USER_ID", "type": "string" }, { "name": "AGE", "type": "int" }, { "name": "GENDER", "type": "string", "categorical": true } ], "version": "1.0" }

The following example shows an Items schema. Only the ITEM_ID field is required. The shown GENDER field is metadata.

{ "type": "record", "name": "Items", "namespace": "com.amazonaws.personalize.schema", "fields": [ { "name": "ITEM_ID", "type": "string" }, { "name": "GENRE", "type": "string", "categorical": true } ], "version": "1.0" }

If you are using the AWS console, you create a new schema when you create a dataset for your input data. You can also choose an existing schema. For more information, see Step 1: Import Training Data.

If you are using the AWS CLI, see Step 1: Import Training Data for an example.

Create a Schema Using the AWS Python SDK

  1. Define the Avro format schema that you want to use.

  2. Save the schema in a JSON file in the default Python folder.

  3. Create the schema using the following code.

    import boto3 personalize = boto3.client('personalize') with open('schema.json') as f: createSchemaResponse = personalize.create_schema( name = 'YourSchema', schema = f.read() ) schema_arn = createSchemaResponse['schemaArn'] print('Schema ARN:' + schema_arn )
  4. Amazon Personalize returns the ARN of the new schema. Store it for later use.

Amazon Personalize provides operations for managing schemas. For example, you can use the ListSchemas API to get a list of the available schemas.

After you create a schema, use it with datasets that match the schema. For more information, see Formatting Your Input Data.