DETAILED ASSESSMENT OF THE QUALITY OF SYNTHETIC DATA AGAINST ORIGINAL CREDIT DATA

Unlock Data Utility with the Synthesized DataOps Platform

You can also download this report
Download as PDF
TECHNICAL SUMMARY

Across a wide range of statistical metrics Synthesized dataset closely resembles the original data whilst preserving the privacy of the original dataset.

Key insights discovered through analysis of the Synthesized data apply to the original dataset.

Performance of Machine Learning models trained on the Synthesized data closely match the performance of models trained on the original dataset.

The Synthesized data platform allows data generating to be configurable — adding more data when needed and increasing the proportion of scarce segments of the original dataset.

The Synthesized platform can correct any imbalances in the dataset. Many machine learning methods perform better when this imbalance is addressed.

Introduction

The credit approval and assignment process can drastically affect the financial outcomes for banking businesses and their customers. And to be able to properly assess the risk of opening a credit line with a determined user, one must rely on user & behavioural data such as previous lending operations, debts, marital status, financial behavior of applicant etc. Credit scoring algorithms, which make a guess at the probability of default, are the techniques banks use to determine whether or not a loan should be granted.

The use of AI in credit scoring, if executed properly, can benefit both the banks and their potential customers. The banks can mitigate risks and speed up decision-making processes, reduce the number of possibly non-performing loans and boost revenues. For customers in need of credit the use of AI technology enables faster and broader access to better lending products.

The dataset

The initial credit dataset taken from Kaggle consists of rows of potential borrowers to a bank. Originally this dataset was used to produce a Machine Learning model that would help banks predict whether a customer may become delinquent soon so they mitigate loan losses by limiting the credit they extend to that customer. However it may not be possible for a bank to share a customer dataset externally or even internally due to data privacy concerns.

Therefore, using a dataset provided by the Synthesized platform is a valuable alternative while preserving the privacy of customer data.

Synthesized data

Before we dive into the details, let’s start with what Synthesized data is. Unlike typical data masking and data anonymization tools, which can be easily attacked by modern techniques and drastically reduce the data quality, our DataOps platform automatically models complex interactions and hidden features in datasets to generate high-quality data products at any scale while maintaining data utility. This data looks and behaves exactly like original data, but consists of entirely new data points, leading to faster, more accurate training of models. Our unique approach learns the complex statistical relationships in the data, enabling automatic generation of new realistic samples at any volume while preserving the full quality and performance of the original set.

Column
Type
Description
SeriousDlqin2yrs
Binary
An indicator as to whether the person hasexperienced serious delinquency in a 2 year window.
NumberOfTime30-59DaysPastDueNotWorse
Integer
Number of times the borrower has been 30-59 days past due.
NumberOfTime60-89DaysPastDueNotWorse
Integer
Number of times the borrower has been 60-89 days past due.
NumberOfTimes90DaysLate
Integer
Number of times the borrower has been 90 daysor more past due.
NumberRealEstateLoansOrLines
Integer
Number of mortgage and real estate loans.
NumberOfDependents
Integer
Number of dependents in family.
RevolvingUtilizationOfUnsecuredLines
Percentage
Total outstanding credit as a proportion of totalcredit limits.
age
Integer
Age of borrower in years.
DebtRatio
Percentage
Debt payments divided by monthly income.
MonthlyIncome
Real
Monthly Income
NumberOfOpenCreditLinesAndLoans
Integer
Number of open loans and lines of credit

TABLE 1: Schema for the credit dataset

In this scenario, the Synthesized dataset preserves many of the statistical properties of the original data whilst protecting privacy of customers. Important insights discovered through analysis of the Synthesized data apply to the original dataset and the machine learning models trained on the Synthesized data will perform just as well as models trained on the original.

Utility

We apply a range of statistical metrics to demonstrate that the Synthesized data captures the important features of the original dataset. In addition, we train several machine learning models on the synthetic data to predict specific features, and show that it retains the predictive capability of the original data.

Metrics

Below we present a summary of the metrics we used to compare the original and synthesized data and measure the quality of our dataset.

Univariate Metrics

Differences between columns in the original and Synthesized dataset (of both continuous and categorical attributes), are measured using two metrics:

  • Earth-Moving distance (EMD) is used for comparing categorical columns. This discrete metric makes few assumptions and takes into account the distance between values if they lie on an ordinal scale.
  • Kolmogorov-Smirnov Distance (KSD) is used for comparing continuous columns. It makes few assumptions about the shape of the true data distribution and is part of a common non-parametric statistical test.

Interaction Metrics

The relationships between each column are compared to ensure that the full distribution of the original data has been learnt. Three main metrics are use:

  • The Kendall-Tau Rank Correlation Coefficient is used to measure the correlation between continuous attributes in each dataset.
  • Cramér’s V is applied to measure the association between categorical attributes.
  • The McFadden’s pseudo-R2 metric is used to measure the association between categorical continuous attributes. In practice, this is obtained by training a logistic regression model to predict the categorial attribute and assessing the difference in performance when the continuous attribute is present/not present. We refer to this as the Categorical Logistic Regression Correlation.

Synthesized data

We plot the different columns of both the original and Synthesized data, with the original columns are in black and the Synthesized columns in green.

FIGURE 1: The categorical data

FIGURE 2: Continuous Data

Univariate Metrics

We provide the distance metrics for each column below: values below 0.1 are usually indicative of a column that closely resembles the original dataset.

Column
Earth-Mover’s Distance
Kolmogorov-Smirnov Distance
SeriousDlqin2yrs
0.007
-
NumberOfTime30-59DaysPastDueNotWorse
0.015
-
NumberOfTimes90DaysLate
0.014
-
NumberRealEstateLoansOrLines
0.026
-
NumberOfTime60-89DaysPastDueNotWorse
0.007
-
NumberOfDependents
0.044
-
RevolvingUtilizationOfUnsecuredLines
-
0.046
age
-
0.018
DebtRatio
-
0.030
MonthlyIncome
-
0.026
NumberOfOpenCreditLinesAndLoans
-
0.006

TABLE 2: Single column metrics of the Synthesized data

FIGURE 1: Single column metrics plotted

Interaction Metrics

On the following page we present the interaction metrics that measure correlations and associations between attributes. Darker squares in the heatmap indicate that columns are more strongly associated with one another. In all cases, we find that interactions for both the original and Synthesized data are very similar:

FIGURE 2: Cramers’ V values for both original and Synthesized data

FIGURE 3: Kendall-Tau Correlation values for both original and Synthesized data

FIGURE 4: Categorical Logistic Correlation values for both Original and Synthesized data

Machine learning modelling

Finally, to show that the predictive capabilities of Machine Learning models trained on the Synthesized data (synthetic dataset) are preserved in the original. We plot the R2 score of a range of different classifiers trained to predict the chance of a serious delinquency in the next 2 years on both datasets and test on a separate real test set.

Figure 5: R2 Scores for multiple classifiers on Synthesized and original data

Additionally as the number of customers experiencing such a delinquency is low in the dataset, we can improve the accuracy of these models by rebalancing the Synthesized data so that there are more examples of delinquency in the dataset.

Figure 6: R2 Scores for multiple classifiers on Synthesized and original data

References

This dataset is publicly available in “GiveMeSomeCredit!” Kaggle competition.