Automated large-scale privacy-preserving data rebalancing for banking and insurance.

A recent survey from the Bank of England and Financial Conduct Authority found that over two thirds of UK financial services organisations have live machine learning applications and their usage is expected to double within the next three years. Financial organisations are well placed to derive business value as they have access to the large and complex datasets that are necessary for building a variety of predictive models - example use cases include the probability of default on loans, fraudulent transaction detection, customer churn prediction, anti-money laundering measures (AML), purchasing intention and personalisation.

As businesses race to drive insight from their data, this highly regulated sector faces its own challenges; the need to protect sensitive personally identifiable information (PII) hinders collaboration, and the time needed to clean this data and check for compliance adds months onto project timescales. Even when prepared, the dataset itself may be under-representative making it ill suited to the accurate development and training of AI solutions. These data limitations are the most cited major barriers that prevent finance organisations utlising their data assets, with over 60% of company data remaining unused for analytics.

In the last three years, synthesized or the so-called synthetic data technologies have increasingly been adopted by leading insurers and consumer-facing financial services companies to solve a number of data provisioning and data preparation challenges. When implemented accurately, the same results can be obtained with synthesized datasets, and the benefits include full data privacy compliance and a major reduction in the time needed for product development and testing; synthesizing high-quality data can take as little as 10 minutes for large-scale datasets.

But this is just the tip of the iceberg, and in this post we demonstrate how to enable the stability and optimal performance of crucial machine learning models in banking and insurance fast using data rebalancing with the Synthesized core platform.

Data Rebalancing with Applications in Personalised Marketing,  Customer Segmentation and Fraud Detection.

Customers always prefer to get personalized financial services which match their needs and lifestyle. Businesses offering customer-facing financial services face the challenge of ensuring that digital communication with their customers meets these demands. Highly personalized experiences are assured with the help of machine learning and advanced data science, extracting the insights from data which encapsulates consumers' preferences, interaction, behavior, lifestyle details and interests. The successful personalization of offers, policies and pricing makes a large contribution to the revenues of the business.

Marketing departments apply various techniques to increase the number of customers and to assure targeted marketing strategies. Customer segmentation plays a pivotal role in this process. Algorithms segment customers according to their financial sophistication, age, location, etc, classifying  them into groups by spotting similarities in their attitude, preferences, behavior, or personal information. As a result, target cross-selling policies may be developed and personal services may be tailored for each particular segment.

A major obstacle to building and validating marketing strategies is getting access to representative data about customer segments. It is very common that the most valuable information for the business is hidden in an under-representative customer category. For example, the online shoppers purchasing intention contains 12,330 sessions, of which only 1908 (15.47 %) ended in shopping, and in the credit loan default dataset contains 663 (6.6%) defaulters out of 10000. If a predictive model is trained on a biased dataset, the results will be biased - in this scenario, any ensuing wrong decisions lead to a higher customer cost of acquisition and a poor experience for those targeted with inappropriate offers, making them less likely to purchase at all.

How to overcome imbalances in data?

A way to overcome this issue is to generate new samples for an under-representative category thereby rebalancing the dataset.

We further provide a numerical illustration. To evaluate the performance of rebalanced datasets we use the so-called AUC score, as it is a widely used metric for imbalanced datasets.

To check how the minority class proportion affects the final results, the following procedure is carried out.

  • We split the dataset into the training and test sets with ratio 4:1.
  • The training set is resampled from the original proportion to 1:1, so that both classes have the same number of samples.
  • We compute the evaluation metrics on the test set that remains unseen.

We compute the AUC metric as we resample data from the original data until we have the same number of samples for both classes. The results of 10 Monte Carlo simulations are shown in Figure 1. We can clearly observe an uptrend on the AUC score as the datasets are resampled.

AUC score as we resample the target variable for the bank churn.

Figure 2 shows the resampled dataset outperforms the original dataset. The privacy of the data is protected, so the data scientist can still look at the data and manipulate it without being in contact with any sensitive information about the users, as the technology used in Synthesized to generate data ensures full compliance on data privacy regulations.

AUC and PR curves for the credit scoring dataset, before and after resampling the dataset usingdifferent techniques.

How data rebalancing improves the performance of models.

Furthermore, it is often critical to detect where precisely the algorithm is making wrong decisions, as the cost of a false negative can be huge compared to a false positive. Both credit scoring and online shoppers purchasing datasets exemplify this matter, as giving credit to a defaulter is much more costly than not giving credit to a non-defaulter, and similarly targeting a non-buyer is usually less expensive than losing a buyer.

Figure 3 throws light on this matter, showing the confusion matrix for both datasets. A "Random Forest" model is trained on the original (left) and re-sampled with Synthesized (right) sets. In the first case, the majority of errors are concentrated on false negatives rather than false positives, while the resampled case, the number of false negatives is drastically decreased.

Confusion matrix for the credit scoring (top) and online shoppers purchasing (bottom) datasets. At the left, the model (Random Forest) has been trained with the original data, and on the right Synthesized has been used to re-sample the training set.

In summary, we have presented how resampling an imbalanced dataset can heavily affect the performance of the machine learning model. Synthesized’s data rebalancing feature is simple to use and is now part of the core product offering, giving users the ability to easily manipulate the distribution of the variables to rebalance the dataset and increase the model’s performance, making the project easier and more successful for the team in charge.

Learn More

Explore the performance of data rebalancing in other common industry-specific scenarios and review the data produced by the Synthesized data provisioning platform in detail by contacting our data experts at


  1. Machine Learning in UK Financial Services:
  2. Why digital businesses should understand their data better:
  3. Artificial data give the same results as real data - without compromising privacy:

Related posts

Subscribe to our blog

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.