Solution Components - Fraud Detection Using Machine Learning

Solution Components

Amazon SageMaker

Fraud Detection Using Machine Learning uses an Amazon SageMaker notebook instance, which is a fully managed machine learning (ML) Amazon Elastic Compute Cloud (Amazon EC2) compute instance that runs the solution’s Jupyter notebook. The notebook is used to train and deploy the solution’s ML model. For more information on notebook instances, see Use Notebook Instances in the Amazon SageMaker Developer Guide.

Algorithm

SageMaker provides several built-in machine learning algorithms that you can use for a variety of problem types. This solution leverages the built-in Random Cut Forest algorithm for unsupervised learning and the built-in XGBoost algorithm for supervised learning. For more information, see How Random Cut Forest Works and How XGBoost Works in the SageMaker Developer Guide.

Dataset

Fraud Detection Using Machine Learning contains a publicly available anonymized credit card transaction dataset that is used to train the solution’s machine learning (ML) model. The dataset was collected and analyzed during a research collaboration of Worldline and the Machine Learning Group of Université Libre de Bruxelles on big data mining and fraud detection. The dataset consists of anonymized credit card transactions over a two-day period in 2013 by European cardholders. In order to preserve the anonymity of the users, all features have been transformed using Principal Component Analysis (PCA), resulting in a dataset with 28 continuous PCA features, and two more features representing time and amount. Because the dataset is derived from real data, the distribution of fraud is low compared to legitimate transactions. Fraudulent transactions make up 0.172% of the total transactions. For more information, see Appendix B.