Every year there are millions of credit card fraud victims, and the costs for the credit card issuer can be huge. Being able to develop a fast and reliable fraud detection system can change drastically the financial performance of the credit card business, and such a system heavily relies on historical data to understand how fraud works and be able to prevent it.
This dataset (from Kaggle dataset repository) contains historical data of 594,643 transactions for 4,112 different users. The target variable (fraud) flags fraudulent payments, and there are 7 other columns in this dataset that contain a time step identifier, personal information about the payer (an identifier and their age and gender), specifics about the transaction (merchant, category and amount).
The objective is to train a ML model that gives a fraud probability for each transaction in a subsample of the dataset. The prediction system has to be properly balanced on reducing fraud as much as possible while keeping the system not too strict, as customer satisfaction can be affected if non-fraudulent transactions are blocked too often. Therefore, ROC AUC score is a good metric as it contains properly balanced information about all types of prediction errors.
Given the temporal dimension of this dataset, it can also be treated as a time-series problem to exploit the temporal relationship between samples.
Although this dataset can make a huge difference on the credit card issue' performance, it has some problems that complicate its usage. Luckily, Synthesized can solve these problems in a fast and intuitive way.
This dataset is publicly available in "Synthetic data from a financial payment system" Kaggle dataset repository.