Online fraud insights - Amazon Fraud Detector

Online fraud insights

Amazon Fraud Detector Online Fraud Insights is a supervised machine learning model designed to detect a variety of online fraud and risks. Because the model is supervised, it requires historical examples of fraudulent and legitimate events to train the model.

The Online Fraud Insights model uses an ensemble of machine-learning algorithms for data enrichment, transformation, and fraud classification. As part of the model training process, Online Fraud Insights enriches raw data elements like IP address and BIN number with 3rd party data such as the geo-location of the IP address or the issuing bank for a credit card. In addition to 3rd party data, Online Fraud Insights uses deep learning algorithms leveraging fraud patterns seen at Amazon and AWS. These fraud patterns become input features to your model using a gradient tree boosting algorithm.

To increase performance, Online Fraud Insights optimizes the hyper parameters of the gradient tree boosting algorithm via a Bayesian optimization process, sequentially training dozens of different models with varying model parameters (such as number of trees, depth of trees, number of samples per leaf) as well as different optimization strategies like upweighting the minority fraud population to take care of very low fraud rates.

By adapting the variables included in the model, Online Fraud Insights can be used to detect a variety of online fraud including:

  • New account fraud: Accurately distinguish between legitimate and high-risk customer account registrations so you can selectively introduce additional steps or checks based on risk.

  • Online payment fraud: Reduce online payment fraud by flagging suspicious online payment transactions before processing payments and fulfilling orders.

  • Guest checkout fraud: Spot potential fraudsters among customers without transaction histories. You can send as little as two pieces of data from a guest checkout order (for example, email, IP address) to assess its potential fraud risk.

  • Fake reviews abuse: Detect potentially fraudulent or fake reviews so that you can review prior to posting.

Preparing training data

Amazon Fraud Detector imports data only from files that are in the comma-separated values (CSV) format. Amazon Fraud Detector requires that the first row of your CSV file to contain column headers. The column headers in your CSV file need to map to the variables defined in the event type. For an example dataset, see Get and upload example training data.

The Online Fraud Insights model requires a training dataset that has at least two variables, 10K total examples, and 400 examples of fraud to train a model. Amazon Fraud Detector will automatically re-balance your dataset if the percentage of events associated to fraud is less than 5%. This improves model performance and results in faster and more stable model tuning.

In addition to the event variables, the training dataset must contain the following headers:

  • EVENT_TIMESTAMP: Defines when the event occurred. For more information, see Event Timestamps Format.

  • EVENT_LABEL: Classifies the event as fraudulent or legitimate. The values in the column must correspond to the values defined in the event type.

For example, the following sample CSV data maps to the event type shown in Create event types. This data represents historical registration events from an online merchant:

EVENT_TIMESTAMP,EVENT_LABEL,ip_address,email_address 4/10/2019 11:05,fraud,209.146.137.48,fake_burtonlinda@example.net 12/20/2018 20:04,legit,203.0.112.189,fake_davidbutler@example.org 3/14/2019 10:56,legit,169.255.33.54,fake_shelby76@example.net 1/3/2019 8:38,legit,192.119.44.26,fake_curtis40@example.com 9/25/2019 3:12,legit,192.169.85.29,fake_rmiranda@example.org

A simplified version of the corresponding event type is represented below. The event variables correspond to the headers in the CSV file and the values in EVENT_LABEL correspond to the values in the labels list.

( name = 'sample_registration', eventVariables = ['ip_address', 'email_address'], labels = ['legit', 'fraud'], entityTypes = ['sample_customer'] )

Data set guidance

The following guidance for training data will help you get the most out of your Online Fraud Insights model.

Gathering data

We recommend that you collect a minimum of six weeks of historic data, though three - six months of data is preferable. There is a maximum file size of 5GB. If the recommended six weeks of historic data exceeds this threshold, shorten the time range of your extraction.

Model variables

The Online Fraud Insights model requires at least two variables for model training. Generally, the more variables you provide the better the model can differentiate between fraud and legitimate events.

EVENT_TIMESTAMP

You must include the header EVENT_TIMESTAMP. Ensure your event timestamp is in the required format. For more information, see Timestamp formats. As part of the model build process, the Online Fraud Insights model type orders your data based on the event timestamp, and splits your data for training and testing purposes. To get a fair estimate of performance, the model first trains on the training dataset, then tests this model on the test dataset.

EVENT_LABEL

You must include the header EVENT_LABEL. The Online Fraud Insights model requires a minimum of 400 observations are identified and labeled as “fraud”.

Data and label maturity

Ensure that records used to train the model have had sufficient time to “mature”, that is, enough time has passed to ensure “legitimate" and “fraud” records have been correctly identified. The maturity period is dependent on your business, and can take anywhere from two weeks to 90 days. For example, for chargeback fraud, it often takes 30 - 60 days (or more) to correctly identify fraudulent events. For the best model performance, ensure that all records in your training dataset are mature. For example, if your maturity is 30 days, ensure that the latest records in your dataset are at least 30 days old.

Sampling

The Online Fraud Insights training process will sample and partition historic data based on EVENT_TIMESTAMP. There is no need to manually sample the data and doing so may negatively impact your model results.

Nulls and missing values

Online Fraud Insights handles null and missing values. However, the percentage of nulls for variables should be limited. EVENT_TIMESTAMP and EVENT_LABEL columns should not contain any missing values.

Timestamp formats

Amazon Fraud Detector supports the following date/timestamp formats for the values in EVENT_TIMESTAMP during model training:

  • %yyyy-%mm-%ddT%hh:%mm:%ssZ (ISO 8601 standard in UTC only with no milliseconds)

    Example: 2019-11-30T13:01:01Z

  • %yyyy/%mm/%dd %hh:%mm:%ss (AM/PM)

    Examples: 2019/11/30 1:01:01 PM, or 2019/11/30 13:01:01

  • %mm/%dd/%yyyy %hh:%mm:%ss

    Examples: 11/30/2019 1:01:01 PM, 11/30/2019 13:01:01

  • %mm/%dd/%yy %hh:%mm:%ss

    Examples: 11/30/19 1:01:01 PM, 11/30/19 13:01:01

Amazon Fraud Detector makes the following assumptions when parsing date/timestamp formats for event timestamps:

  • If you are using the ISO 8601 standard, it must be an exact match of the above specification

  • If you are using one of the other formats, there is additional flexibility:

    • For months and days, you can provide single or double digits. For example, 1/12/2019 is a valid date.

    • You do not need to include hh:mm:ss if you do not have them (e.g. you can simply provide a date). You can also provide a subset of just the hour and minutes (e.g. hh:mm). Just providing hour is not supported. Milliseconds are also not supported.

    • If you provide AM/PM labels, a 12-hour clock is assumed. If there is no AM/PM information, a 24-hour clock is assumed.

    • You can use “/” or “-” as delimiters for the date elements. “:” is assumed for the timestamp elements.

Training data validations

As part of the training process, Online Fraud Insights will validate the training dataset for data quality issues that may impact model training. After validating the data, Amazon Fraud Detector will take appropriate action to build the best possible model. This includes issuing warnings for potential data quality issues, automatically removing variables that have data quality issues, or issuing an error and stopping the model training process.

Amazon Fraud Detector will fail to train a model if any of the following conditions are triggered:

  • If the CSV is unable to be parsed

  • If the datatype for a column is incorrect

  • If the number of rows is < 10k

  • If the number of rows identified as fraud or legitimate are < 400

  • If more than 0.1% of values in EVENT_TIMESTAMP contains nulls or values other than the supported date/timestamp formats

  • If more than 1% of the values in EVENT_LABEL contains nulls or values other than those defined in the event type

  • If less than two variables are available for model training