April 30, 2021

Solving Data Imbalance with Synthetic Data

Solving Data Imbalance with Synthetic Data

Will a customer purchase this product? Is this transaction fraudulent? Is this a picture of a cat or a dog? These are high value business problems - well, maybe not the last one - that can be solved with the appropriate machine learning techniques. However, as always, data is king and without access to high-quality balanced data the efforts to answer these questions will be fruitless.

Most real-world datasets are highly skewed and show bias towards a particular outcome, category or segment - especially those related to the detection of rare events. 

For example, consider the problem of predicting whether a credit card transaction is fraudulent. Fortunately for the lenders, the overwhelming majority of purchases are legitimate. Unfortunately for the data scientist, their dataset of transactions will contain only a faint signal of fraudulent activity; predicting fraud is a highly imbalanced classification task. 

Common Pitfalls in Applying Machine Learning Techniques for Imbalanced Classification

When applying machine learning techniques for imbalanced classification, one may encounter a number of pitfalls: some models are unsuitable, model explainability may suffer and unwanted biases may be propagated.

When large dataset imbalances are present, models can achieve apparently stellar performance just by predicting everything as the majority outcome. For example, with a dataset that contains 99% legitimate transactions, such a model would have an accuracy of 99%. Great, right? Unfortunately not!

When the imbalance is taken into account, this becomes much less impressive when the real value is incorrectly identifying the transactions that were fraudulent. (For the curious, more appropriate metrics to look at in this case are precision and recall)

Traditional Dataset Rebalancing Techniques

There is a range of well-studied and utilised techniques that aim to solve the problem of class imbalance, and these fall into two categories: 

  1. Sampling-based methods, which aim to augment and reshape the underlying data; 
  2. Model-based methods which directly constrain how a model can learn from the data.  

Sampling based techniques aim to ‘rebalance’ the data, ensuring there is an equal representation of each outcome. The simplest approach is to randomly undersample the majority outcome, or oversample the minority. The drawbacks here are clear: there is either a reduction in the training size or a duplication of records, leading to a reduction in data variability that can result in model overfitting.

More advanced techniques rely on creating new data-points for the minority outcome to achieve a balanced distribution of classes. SMOTE (Synthetic Minority Oversampling Technique) is one such method, available in open-source projects such as imbalanced-learn

However, it is not based on a statistical understanding of the data, and is problematic with non-continuous variables and high-dimensional datasets. For complex datasets, SMOTE often does not provide an advantage over random sampling of the original data.

Scalable Rebalancing Solutions

Alternatively, synthetic data produced with the Synthesized platform can be used to rebalance, and we believe this is the most scalable and powerful technique. With a deep understanding of the data, our approach can go beyond simple rebalancing of individual classes, and enables tweaking and reshaping of the arbitrary groups within a dataset. This allows users to generate a range of custom scenarios for testing and development purposes. 

Additionally, the Synthesized platform provides a powerful all-in-one solution for common data science tasks, e.g. data-augmentation and missing value imputation, all whilst being privacy preserving by design.

The Test: Rebalancing with Synthetic Data versus Original Dataset

To demonstrate rebalancing with synthetic data, we apply this method to the Kaggle credit card fraud detection dataset. It contains anonymised credit card transaction details, of which approximately 99.8% are legitimate and the remaining 0.02% are fraudulent; an extreme, but realistic data imbalance for this type of problem.

To predict the fraudulent transactions, we train separate logistic regression classifiers on:

  • The original dataset, a balanced dataset using SMOTE, and
  • a balanced synthetic dataset created with the Synthesized platform. 

These trained models are then evaluated and compared on an unseen sample of the original imbalanced dataset.

Before getting to the results, it is interesting to understand what our synthetic fraud examples look like. 

One method to achieve this is to visualise a 2-dimensional representation of the Synthesized dataset using a UMAP embedding. With this we can identify how distinct fraudulent and non-fraudulent transactions are, and whether there is any significant clustering.

Visualisation of a 2-dimensional representation of the Synthesized dataset using a UMAP embedding

Each red point is a completely new synthetic example of fraud that the Synthesized platform has been able to generate from only a small sample of real fraud examples. The fact that the two clusters are separate indicates that there is a distinct difference between them. The smaller clusters indicate that there is a variety of synthetic examples, and they aren't all duplicated data points with the same characteristics.

So, onto the results...

How Well Can We Predict Fraud with Our Three Datasets? 

Unfortunately, there is no obvious metric to use, as the ideal choice depends on the costs to the business of missing fraud or incorrectly flagging legitimate transactions. However, we can look at the area under the ROC-curve (AUC-ROC), together with a confusion matrix to understand how well the model can find and correctly predict fraudulent activity, shown below.

Fraud Prediction Comparison in 3 Datasets

Key Takeaways of the Analysis

  • It is clear that on the original imbalanced dataset, the model can only correctly find approx. 60% of the fraudulent transactions; it is biased towards classifying everything as non-fraudulent and has a ROC-AUC score of 0.93. 
  • With SMOTE we see a significant improvement, with the model able to identify almost 80% of the fraud cases, and a ROC-AUC of 0.96. 
  • Looking at our balanced Synthesized dataset, we obtain even better results -- all cases of fraud in the test data have been successfully identified and the ROC-AUC increases to 0.99! One possible reason for this is the larger variety of fraud cases that can be generated using synthetic data. 

However, you may notice that this comes at a drawback of an increased false positive rate (legitimate transactions incorrectly classified as fraud). This is an inherent trade-off between the precision and recall of a classifier, and is an understood phenomenon that occurs with resampling techniques. 


To summarise, data imbalance is a problem that affects most real-world datasets, and must be handled correctly when training predictive models. Synthesized offers a powerful solution for data scientists to rebalance their datasets with high quality synthetic data that may produce significantly better results than conventional techniques.

In addition, the Synthesized platform can solve a number of common problems for data scientists, all in the same solution, with privacy-preservation by design.