Store your event data with Amazon S3 - Amazon Fraud Detector

Store your event data with Amazon S3

If you are training an Online Fraud Insights model, you can choose to store your event data externally with Amazon S3. To store your event data in Amazon S3 you must first create a text file in CSV format, add your event data, and then upload the CSV file to an Amazon S3 bucket.

Note

Transaction Fraud Insights model does not support datasets stored externally with Amazon S3

Event Timestamp formats

Ensure that your event timestamp is in the required format. As part of the model build process, the Online Fraud Insights model type orders your data based on the event timestamp, and splits your data for training and testing purposes. To get a fair estimate of performance, the model first trains on the training dataset, and then tests this model on the test dataset.

Amazon Fraud Detector supports the following date/timestamp formats for the values in EVENT_TIMESTAMP during model training:

  • %yyyy-%mm-%ddT%hh:%mm:%ssZ (ISO 8601 standard in UTC only with no milliseconds)

    Example: 2019-11-30T13:01:01Z

  • %yyyy/%mm/%dd %hh:%mm:%ss (AM/PM)

    Examples: 2019/11/30 1:01:01 PM, or 2019/11/30 13:01:01

  • %mm/%dd/%yyyy %hh:%mm:%ss

    Examples: 11/30/2019 1:01:01 PM, 11/30/2019 13:01:01

  • %mm/%dd/%yy %hh:%mm:%ss

    Examples: 11/30/19 1:01:01 PM, 11/30/19 13:01:01

Amazon Fraud Detector makes the following assumptions when parsing date/timestamp formats for event timestamps:

  • If you are using the ISO 8601 standard, it must be an exact match of the preceding specification

  • If you are using one of the other formats, there is additional flexibility:

    • For months and days, you can provide single or double digits. For example, 1/12/2019 is a valid date.

    • You do not need to include hh:mm:ss if you do not have them (taht is, you can simply provide a date). You can also provide a subset of just the hour and minutes (for example, hh:mm). Just providing hour is not supported. Milliseconds are also not supported.

    • If you provide AM/PM labels, a 12-hour clock is assumed. If there is no AM/PM information, a 24-hour clock is assumed.

    • You can use “/” or “-” as delimiters for the date elements. “:” is assumed for the timestamp elements.

Sampling your dataset across time

We recommend that you provide examples of fraud and legitimate samples from the same time range. For example, if you provide fraud events from the past 6 months, you should also provide legitimate events that evenly span the same time period. If your dataset contains a highly uneven distribution of fraud and legitimate events, you might receive the following error: "The fraud distribution across time is unacceptably fluctuant. Cannot split dataset properly." Typically, the easiest fix for this error is to ensure that the fraud events and legitimate events are sampled evenly across the same timeframe. You also might need to remove data if you experienced a large spike in fraud within a short time period.

If you cannot generate enough data to create an evenly distributed dataset, one approach is to randomize the EVENT_TIMESTAMP of your events such that they are evenly distributed. However, this often results in performance metrics being unrealistic because Amazon Fraud Detector uses EVENT_TIMESTAMP to evaluate models on the appropriate subset of events in your dataset.

Null and missing values

Amazon Fraud Detector handles null and missing values. However, the percentage of nulls for variables should be limited. EVENT_TIMESTAMP and EVENT_LABEL columns should not contain any missing values.

File validation

Amazon Fraud Detector will fail to train a model if any of the following conditions are triggered:

  • If the CSV is unable to be parsed

  • If the datatype for a column is incorrect