Transcript
Hello.
Today we're going to be diving into an example of how Synthesized can be used to improve the performance of your fraud detection models.
In this example, we have a transactions dataset with 8 columns. You can see fraud is the binary target on the left.
Let’s see how good a simple model is on the existing dataset. We'll use age, gender, category, and amount as explanatory variables to try to predict fraudulent transactions.
So we get an ROC AUC of about 88%. Not bad! But we can do better by adding in some synthetic data.
First, we'll need to extract some metadata, build a generative model and train it. But this is easy with Synthesized. The training process itself doesn't take long either. Here we have 8 columns and about 20 thousand rows. and on a 4-core CPU, it's going to take about 3 - 5 minutes.
Once the generative model is trained we can use it to *upsample* the number of fraudulent transactions in our training dataset and thereby amplify the signal of fraud in the dataset. Fraud datasets are typically very imbalanced with a weak signal. Synthesized can be used to highlight this signal and improve model performance.
It's finished training now. Let's use a Conditional Sampler to generate a dataset but the amount of fraud rebalanced to be 50:50.
Now that we've created the new dataset, let's validate what it looks like compared to the original. We can do that with the Assessor class. Let’s save that figure and have a look.
As you can see, the fraud in the new dataset has been upsampled to a 50:50 split.
Now we can reevaluate our model, comparing its performance when trained on the synthetic dataset, to that trained on the original dataset and evaluated on some held out original data.
We've improved the performance here from 88% to 95% -> an absolute difference of 7%. And it only took about 5 minutes to do!
This has been a walkthrough of just one of the ways Synthesized can help you extract the most out of your data. Thank you for listening.