A Synthesized case study - Blog

Introduction

In this mini case study, we will explore one smaller bank and one large bank that experienced contrasting data-relating problems within a similar use case, and how the SDK was implemented to solve both challenges and create positive business outcomes.

South American digital bank

Synthesized worked with a neo-bank headquartered in South America with a digital banking platform that aims to improve the lives of unbanked and underbanked individuals.

The digital bank’s challenge

The bank was planning the expanded launch of its consumer banking services in Brazil. However, as they were just coming to market, they had a limited number of banking customers. Their fraud detection machine learning models were lacking in performance because of scarce data, both on normal behavior and known fraudsters. As a result, they could not get 100% coverage on their models by relying solely on raw data.

The solution

The Synthesized Scientific Data Kit was evaluated and quickly implemented to enable the bank to generate more data to augment the existing dataset. In the initial use case, they used the SDK to increase the data size from a few hundred rows to a few thousand rows and represent a much larger incidence of fraud to better train their fraud ML models and achieve a higher range of coverage than they could before.

Business outcome

As a result of the model performance improvements, the bank was able to save significant costs associated with fraudulent transactions, which ultimately increased its profitability.

Large American bank

This large American bank had a slightly different problem than the smaller bank. While the smaller bank had enough instances of fraudulent transactions but not enough data to train their ML models on, the larger bank had enough data to train their ML models on, but not enough fraud cases to accurately train the models.

The challenge

The large bank had 100k rows of raw data, with only 0.3% of the data representing fraud instances. This is a very imbalanced dataset, which was a challenge that the bank needed to tackle to accurately detect fraud cases using fraud machine learning models. The imbalanced datasets would cause the fraud ML models to be less accurate in detecting fraudulent transactions, which could be costly due to the number of transactions that would subsequently be missed.

The solution

To solve this problem, the bank used the Synthesized SDK to generate 100k rows of synthetic tabular data, upsampling the fraud cases in the synthetic dataset to a 70/30 split of normal behavior transactions to fraudulent scenarios. The bank analyzed the synthetic data to ensure its quality and suitability for their machine learning models. The bank subsequently trained its ML model using the rebalanced dataset and saw significant improvements in performance.

Business outcome

After integrating synthetic data into several of their machine learning models, the bank saw a 1-2% improvement in the machine learning model's accuracy in about 60% of models, and in some cases up to a 17% improvement. The bank now has a more performant machine learning model that can detect fraudulent transactions with higher success, reducing the risk of financial loss from fraud.

Conclusion

The two mini-case studies highlight how the SDK was a powerful solution for both small and large banks facing distinct data challenges. By leveraging synthetic data generated by the SDK, both banks were able to improve the performance of their fraud machine-learning models, which ultimately led to significant cost savings and increased profitability. The smaller bank was able to augment its existing datasets by generating synthetic data, while the larger bank was able to rebalance its imbalanced datasets with synthetic data. Both mini-case studies highlight the potential benefits of using synthetic data to address data-related problems in the financial industry and the effectiveness of the SDK in gaining these benefits.