Data Science projects present a unique array of challenges. Synthesized can help to tackle many of them, including:
Synthetic data technologies have increasingly been adopted by leading insurance companies and businesses providing consumer-facing financial services in the last three years. When implemented accurately, the same results can be obtained with Synthesized datasets, and the benefits of Synthesized data include full data privacy compliance and faster product development and testing. Generating high-quality data can take as little as 10 minutes for complex datasets.
But this is just the tip of the iceberg, and there are many other ways to exploit Synthesized data in data science and machine learning. In this white paper, we explore powerful applications of synthesized data in data science, compare different techniques and scenarios, and show how the data produced by the Synthesized platform can help in these situations.
To showcase the business benefits of Synthesized data in data science and machine learning, we focus on six realistic datasets, see Table 1.
Customers always prefer to get personalized financial services that match their needs and lifestyle. Businesses offering customer-facing financial services satisfy these demands with the help of artificial intelligence and advanced data science, extracting the insights from data which encapsulates consumers' preferences, interaction, behavior, lifestyle details and interests. The personalisation of offers, policies and pricing largely contribute to the rates of the business.
Marketing departments apply various techniques to increase the number of customers to target their marketing strategies. Customer segmentation plays a pivotal role in this process. Algorithms perform customer segmentation according to their financial sophistication, age, location, etc., classifying customers into groups by spotting similarities in their attitude, preferences, behavior, or personal information. As a result, target cross-selling policies may be developed and personal services may be tailored for each particular segment.
A major obstacle to building and validating marketing strategies is accessing representative data about customer segments. The most valuable information for the business is commonly hidden in an under- representative customer category. For example, the online shoppers purchasing intention contains 12,330 sessions, of which only 1908 (15.47 %) ended in shopping, and the credit loan default dataset contains 663 (6.6%) defaulters out of 10000.
A way to overcome this issue is to generate new samples for an under-representative category, thereby rebalancing the dataset. Here, we compare three different techniques:
We provide further evidence and compare the three methods. To evaluate the performance of rebalanced datasets, we use the so-called AUC score, as it is a widely used metric for imbalanced datasets.
To check how the minority class proportion affects the final results, the following procedure is applied:
We compute the AUC metric as we resample data from the original dataset until we have the same number of samples for both classes. The results of 10 Monte Carlo simulations are shown in Figure 1. We can clearly observe an uptrend in the AUC score as the datasets are resampled. Of the three techniques compared in this study, the Synthesized data manipulation toolbox shows the best performance.
Besides, Figure 2 shows the ROC curve and PR curve for distinct techniques for the credit loan default dataset, comparing the original with the resampled dataset. Again, the resampled techniques outperform the original dataset. In this case, all techniques exhibit similar behaviours, but Synthesized shows one key difference. Unlike the other techniques, the privacy of the data is protected, so the data scientist can still look at the data and manipulate it without viewing any sensitive user information, as the technology used in Synthesized to generate data ensures full compliance with data privacy regulations.
Previously in this section, the machine learning algorithms have been evaluated at a high level, but it’s critical to detect where precisely the algorithm is making wrong decisions, as the cost of a false negative can be huge compared to a false positive. Both credit scoring and online shopper purchasing datasets are good examples of this, as giving credit to a defaulter is much more costly than not giving credit to a non-defaulter, and similarly targeting a non-buyer is usually less expensive than losing a buyer.
Figure 3 sheds light on this matter, showing the confusion matrix for both datasets. A Random Forest is trained on the original (left) and re-sampled with Synthesized (right) sets. In the first case, the majority of errors are concentrated on false negatives rather than false positives, while the resampled case, the number of false negatives is drastically reduced.
In summary, the experiments presented in this section showed how resampling an imbalanced dataset can heavily affect the performance of the classification model. Synthesized is simple and fast to use, making it possible to manipulate the distribution of target variables to rebalance the dataset and increase the model’s performance. Its ease of use and speed are a boon to data scientists managing a large-scale project.
Although the amount of data generated is increasing drastically, users are currently more concerned about how third parties collect, share and use their data. Ethical concerns take on greater salience, legal compliance requirements become more strict, and customers can even lose trust in companies that collect more of their data than needed.
In other words, although there is more data available, the data collection and preparation process can become arduous and expensive. Additionally, businesses might be facing situations where data projects must move forward with small datasets. Simulated datasets, if generated accurately, can provide meaningful insights and help overcome problems such as shortage of data, whilst preserving customer privacy.
One of the main drawbacks of small datasets is that it becomes more difficult to avoid over-fitting and therefore fitting complex models such as Neural Networks becomes even more laborious. Models trained on smaller datasets have higher variance as they are highly affected by small perturbations such as outliers or noise.
The figure above illustrates this issue. N points are sampled from a ground truth line with additive noise. Then, a ninth degree polynomial function is fit by minimizing the sum of the squared errors on these samples. For the N=15 case (left), the regression overfits and is not able to learn the ground truth as accurately as the N=100 case (right).
To explore this problem and demonstrate how it can be overcome with Synthesized data, we set up an experiment as follows:
Figure 5 shows 10 Monte Carlo simulations of this procedure for the Absenteeism dataset, which contains only 740 rows and 21 columns. For the top two plots, each line corresponds to a different experiment, and the bottom two show the standard deviation at each step.
As can be observed on the top two plots, the average of both F1 and AUC is quite similar for all sampling proportions, so augmenting the dataset doesn’t show an improvement on the average performance. Bottom two plots contain the standard deviation for all the simulations showing a clear downtrend on the stability of the metrics as we add Synthesized data to the training set. Especially in the AUC score (left), the experiment results converge to smaller variance (top) and the standard deviation is drastically reduced as the sample size increases (bottom).
The source of this instability is overfitting. Adding Synthesized data to the training set reduces variance, augmenting model’s stability and reducing its sensitivity to outliers. In other words, adding Synthesized data to the training set can be thought of as another form of regularization.
In the case described above, this regularization factor added by Synthesized data helped to make model learning more stable, but in some other cases, it additionally shows an improvement on the performance metrics. The Japanese credit application dataset has been sampled from 70% to 300% of its original size to train a Gradient Boosting Machine, and the metrics are computed again in an unseen test set.
The results are displayed in Figure 6.
For a credit dataset with ten thousand rows, the results still show an improvement on the resulting evaluation metrics. In this case, a neural network has been fitted again on the data resampled, and the outcomes are presented in Figure 7. On the top left, the AUC shows a clear increase as the sampling proportion augments, as can also be observed on the ROC curve on the right, where the original results are compared to three different samples. On the bottom left, the F1 score also shows a slight increase on the average performance.
In sum, this section has demonstrated the benefits of using Synthesized for Data Augmentation, showing that it is another form of regularization. It ensures model stability for small datasets where one can easily overfit, but it can also improve model performance.
Customer behaviour predictive models, such as credit scoring or targeting, work under the assumption that past behaviour will be repeated in the future. But as the world changes, human behaviour can change, and an improperly validated model can lead to costly consequences, such as:
With proper model validation, the effects of these shifts can be minimized. In this section, we compare the stability of two models (a Random Forest and a Neural Network) and analyse how they behave if the test population is altered. To do so, we use Synthesized to generate three different scenarios, while keeping the training set the same.
The results for both models trained and tested on the raw credit scoring dataset are the following:
Looking at these metrics, one may think that the Neural Network is the best model to use, but a proper model validation should be performed before deploying this model. Here, Synthesized is used to generate two scenarios, one where the default rate is decreased to 5%, and another to 20%.
The ROC curves for these scenarios and both classifiers are shown in Figure 8. Random Forest shows much more stable behaviour than the Neural Network, as the performance of the latest is highly degraded by when the population changes. Similarly, the confusion matrices can be observed in Figure 9. The false negative error (usually the most expensive one) increases in both cases when the default rate goes up, but whilst the Random Forest only has 121 samples in this kind of error, the Neural Network has 432.
A data scientist has to properly validate a model before deploying it in production. In this section, we have shown how Synthesized can also be used to simulate hypothetical scenarios with the data manipulation toolbox, and ensure the model’s stability under unexpected population shifts.
 Chawla, N.V., Bowyer, K.W., Hall, L.O. and Kegelmeyer, W.P., 2002. SMOTE: synthetic minority over- sampling technique. Journal of artificial intelligence research, 16, pp.321-357.
 MLA. Bishop, Christopher M. Pattern Recognition and Machine Learning. New York: Springer, 2006.