Create event dataset - Amazon Fraud Detector

Create event dataset

An event dataset is the historical fraud data for your company. You provide this data to Amazon Fraud Detector to create fraud detection models.

Amazon Fraud Detector uses machine learning models for generating fraud predictions. Each model is trained using a model type. The model type specifies the algorithms and transformations that are used for training the model. Model training is the process of using a dataset that you provide to create a model that can predict fraudulent events. For more information, see How Amazon Fraud Detector works

The dataset used for creating fraud detection model provides details of an event. An event is a business activity that is evaluated for fraud risk. For example, an account registration can be an event. The data associated with the account registration event can be event dataset. Amazon Fraud Detector uses this dataset to evaluate account registration fraud.

Before you provide your dataset to Amazon Fraud Detector for creating a model, make sure to define your goal for creating the model. You also need to determine how you want to use the model and define your metrics for evaluating if the model is performing based on your specific requirements.

For example, your goals for creating a fraud detection model that evaluates account registration fraud can be the following:

  • To auto-approve legitimate registrations.

  • To capture fraudulent registrations for later investigation.

After you determined your goal, the next step is to decide how you want to use the model. Some examples for using fraud detection model to evaluate registration fraud are the following:

  • For real-time fraud detection for each account registration.

  • For offline evaluation of all account registrations every hour.

Some examples of metrics that can be used to measure the performance of the model include the following:

  • Performs consistently better than the current baseline in production.

  • Captures X% fraud registrations with Y% false positives rate.

  • Accepts up to 5% of auto-approved registrations that are fraudulent.

Prepare event dataset

Amazon Fraud Detector requires that you provide your event dataset in a text file using comma-separated value (CSV) in the UTF-8 format. The first line of your CSV dataset file must contain file headers. The file header consists of event metadata and event variables that describe each data element that’s associated with the event. The header is followed by event data. Each line consists of data elements from a single event.

Event metadata

We recommend that you provide the following metadata in your CSV file header. The event metadata must be in uppercase letters.

  • EVENT_ID - A unique identifier for the event.

  • ENTITY_TYPE - The entity that performs the event, such as a merchant or a customer.

  • ENTITY_ID - An identifier for the entity performing the event.

  • EVENT_TIMESTAMP - The timestamp when the event occurred. The timestamp must be in ISO 8601 standard in UTC.

  • EVENT_LABEL- A label that classifies the event as fraudulent or legitimate. You can use any labels, such as "fraud", "legit", "1", or "0".

  • LABEL_TIMESTAMP - The timestamp when the label was last updated.

Note

Event metadata requirements can change depending on the type of model you want to create. For example, if you’re creating Online Fraud Insights model or evaluating events in real-time to make fraud predictions, then you don’t need to provide ENTITY_ID, ENTITY_TYPE, and LABEL_TIMESTAMP in CSV file header and in the dataset. For more information, see Choose a model type and Get fraud predictions.

Event variable

Depending on your event and the model that you create, you might want to include data elements such as email_address, billing_address, ip_address, and phone_number, in your dataset. These data elements are called event variables and are provided as part of the file header in addition to the event metadata. The event variables must be in lowercase letters.

Because every business situation is different, we recommend that you add as many variables that are typically available and collected at the time of the event. When you start to create your model, you can choose to eliminate event variables that might not be helpful or if they have the potential to create a model that does not generalize well. For more information about event variables that can be included in your model for predicting fraud, see Create a variable.

Event data

The file header in your CSV file must be followed by event data with each row consisting of data elements collected from a single event. The event data values must correspond with the file header.

The following is an example of a CSV file containing data from an account registration event. The header row contains both event metadata in uppercase and event variables in lowercase followed by the event data. Each row in the dataset contains data elements associated with single account registration with each data element corresponding with the header.

Gather event data

Gathering your event data is an important step in creating your model. This is because the performance of your model in predicting fraud is dependent on the quality of your dataset. As you start to gather your event data, keep in mind the goals that you defined earlier for creating the model and how you intend to use it. This helps you decide what event variables to include. It’s also important to decide the format of each event variable you intend to include and the total size of your dataset.

Event dataset quality

To gather high quality dataset for your model, we recommend the following:

  • Collect mature data- Using the most recent data helps to identity the most recent fraud pattern. However, to detect fraud use cases, allow the data to mature. The maturity period is dependent on your business, and can take anywhere from two weeks to three months. For example, if your event includes credit card transaction, then the maturity of the data might be determined by the chargeback period of the credit card or time taken by an investigator to make determination.

    Ensure that the dataset used to train the model have had sufficient time to mature as per your business.

  • Make sure the data distribution doesn’t drift significantly- Amazon Fraud Detector model training process samples and partitions your dataset based on EVENT_TIMESTAMP. For example, if your dataset consists of fraud events pulled from last 6 months, but only the last month of legitimate events are included, the data distribution is considered to be drifting and unstable. An unstable dataset might lead to biases in model performance evaluation. If you find the data distribution to be drifting significantly, consider balancing your dataset by collecting data similar to the current data distribution.

  • Make sure the dataset is representative of the use case where the model is implemented/tested- Otherwise, the estimated performance could be biased. Let us say that you are using a model to automatically decline all in-door applicants, but your model is trained with a dataset that has historical data/labels which were previously approved. Then, your model's evaluation might be inaccurate because the evaluation is based on the dataset that does not have representation from declined applicants.

Event data format

Amazon Fraud Detector transforms most of your data to the required format as part of its model training process. However, there are some standard formats you can easily use for providing your data that can help avoid issues later when Amazon Fraud Detector validates your dataset. The following table provides guidance on the formats for providing the recommended metadata.

Metadata name Format Required

EVENT_ID

If provided, it must meet the following requirements:

  • It is unique for that event.

  • It represents information that’s meaningful to your business.

  • It follows the regular expression pattern (for example, ^[0-9a-z_-]+$.)

  • In addition to the above requirements, we recommend that you do not append a timestamp to the EVENT_ID. Doing so might cause issues when you update the event. This because you must provide the exact same EVENT_ID if you do this.

Depends on the model type

EVENT_TIMESTAMP

  • It must be specified in one of the following formats:

    • %yyyy-%mm-%ddT%hh:%mm:%ssZ (ISO 8601 standard in UTC only with no milliseconds)

      Example: 2019-11-30T13:01:01Z

    • %yyyy/%mm/%dd %hh:%mm:%ss (AM/PM)

      Examples: 2019/11/30 1:01:01 PM, or 2019/11/30 13:01:01

    • %mm/%dd/%yyyy %hh:%mm:%ss

      Examples: 11/30/2019 1:01:01 PM, 11/30/2019 13:01:01

    • %mm/%dd/%yy %hh:%mm:%ss

      Examples: 11/30/19 1:01:01 PM, 11/30/19 13:01:01

  • Amazon Fraud Detector makes the following assumptions when parsing date/timestamp formats for event timestamps:

    • If you are using the ISO 8601 standard, it must be an exact match of the preceding specification

    • If you are using one of the other formats, there is additional flexibility:

      • For months and days, you can provide single or double digits. For example, 1/12/2019 is a valid date.

      • You do not need to include hh:mm:ss if you do not have them (that is, you can simply provide a date). You can also provide a subset of just the hour and minutes (for example, hh:mm). Just providing hour is not supported. Milliseconds are also not supported.

      • If you provide AM/PM labels, a 12-hour clock is assumed. If there is no AM/PM information, a 24-hour clock is assumed.

      • You can use “/” or “-” as delimiters for the date elements. “:” is assumed for the timestamp elements.

Yes

ENTITY_ID

  • It must follow the regular expression pattern: ^[0-9A-Za-z_.@+-]+$.

  • If the entity id isn’t available at the time of evaluation, specify the entity id as unknown.

Depends on the model type

ENTITY_TYPE

You can use any string

Depends on the model type

EVENT_LABEL

You can use any labels, such as "fraud", "legit", "1", or "0".

Required if LABEL_TIMESTAMP is included

LABEL_TIMESTAMP

It must follow the timestamp format.

Required if EVENT_LABEL is included

For information about the supported formats for event variables, see Create a variable.

Null or missing values

The EVENT_TIMESTAMP and EVENT_LABEL variables must not contain any null or missing values. You can have null or missing values for other variables. However, we recommend that you only use a small number nulls for those variables. If Amazon Fraud Detector determines that there are too many null or missing values for an event variables, it will automatically omit variable from your model.

Minimum variables

When you create your model, the dataset must include at least two event variables in addition to the required event metadata. The two event variables must pass the validation check.

Event dataset size

Required

  • Minimum of 10K total events data

  • Dataset must include at least 400 events (rows) classified as fraudulent and 400 events (rows) classified as legitimate

  • If your model type requires ENTITY_ID, your dataset must have more than 100 unique entities.

Recommended

  • Minimum of three weeks of historic data, but at best six months of data.

Dataset validation

Before Amazon Fraud Detector starts to create your model, it checks if the variables included in the dataset for training the model meets the size, format, and other requirements. If the dataset doesn’t pass the validation, model isn’t created. You must first fix the variables that didn’t pass the validation before you create the model. Amazon Fraud Detector provides you with a Data profiler which you can use to help you identify and fix issues with your dataset before you start to train your model

Data profiler

Amazon Fraud Detector provides an open-source tool for profiling and preparing your data for model training. This automated data profiler helps you avoid common data preparation errors and identify potential issues like mis-mapped variable types that would negatively impact model performance. The profiler generates an intuitive and comprehensive report of your dataset, including variable statistics, label distribution, categorical and numeric analysis, and variable and label correlations. It provides guidance on variable types as well as an option to transform the dataset into a format that Amazon Fraud Detector requires.

Using data profiler

The automated data profiler is built with an AWS CloudFormation stack, which you can easily launch with a few clicks. All codes are available on Github. For information on how to use data profiler, follow directions in our blog Train models faster with an automated data profiler for Amazon Fraud Detector

Common event dataset errors

The following are some of the common issues Amazon Fraud Detector comes across when validating an event dataset. After you run the data profiler, use this list to check your dataset for errors before creating your model.

  • CSV file isn’t in the UTF-8 format.

  • The number of events in the dataset is less than 10K.

  • The number of events identified as fraud or legitimate is less than 400.

  • The number of unique entities associated to a fraud event is less than 100.

  • More than 0.1% of values in EVENT_TIMESTAMP contains nulls or values other than the supported date/timestamp formats.

  • More than 1% of the values in EVENT_LABEL contains nulls or values other than those defined in the event type.

  • Less than two variables are available for model training.

Dataset storage

After you gathered your dataset, you store your dataset internally using Amazon Fraud Detector or externally with Amazon Simple Storage Service (Amazon S3). We recommend that you choose where to store your dataset based on the model you use for generating fraud predictions. For more information, see Choose a model type. The following is a detailed breakdown of these two storage options.

  • Internal storage- Your dataset is stored with Amazon Fraud Detector. All event data associated with an event is stored together. You can upload the event dataset that’s stored with Amazon Fraud Detector at any time. You can either stream events one at a time to an Amazon Fraud Detector API, or import large datasets (up to 1GB) using the batch import feature. When you train a model using the dataset stored with Amazon Fraud Detector, you can specify a time range to limit the size of your dataset.

  • External storage- Your dataset is stored in an external data source other than Amazon Fraud Detector. Currently, Amazon Fraud Detector supports using Amazon Simple Storage Service(Amazon S3) for this purpose. If your model is on a file that’s uploaded to Amazon S3, that file can’t be more than 5GB of uncompressed data. If it’s more than that, make sure to shorten the time range of your dataset.