Generative modelling for unstructured data has received a lot of attention in the last 5 years around the world with hundreds of papers published on different architectures. There has been a lot of attention to its applications among the developer communities with notable examples of GANs
Yet at the same time, generative modelling for structured data has received significantly less attention despite the business need being arguably much greater.
The business need is primarily driven by 2 core problems:
In this blogpost we share 3 valuable applications of generative modeling for structured data in machine learning.
Datasets commonly contain fewer data points in some areas within their domain, and this can decrease model performance if not treated carefully. Generative models can be used to reshape datasets and upsample areas where the density is low. This is especially useful for imbalanced datasets and the so-called simulated data scenarios.
There are a few different techniques used to overcome this problem. One of the most commonly used is to oversample the dataset to obtain a new dataset with a balanced marginal distribution for the target class. The classifier is trained on the new dataset and the number of false negatives is drastically reduced. With a generative model, it’s possible to generate a new dataset that contains the desired distribution.
The confusion matrices below compare the performance of a Random Forest classifier trained on the original (imbalanced) dataset and a conditionally resampled dataset, for two different source datasets. As can be observed, false negatives decreased and true positives increased
Another application of data reshaping is to modify distributions in the validation set, and quantify how robust a model is given population shifts and how it performs in different data segments. The goal is to answer questions such as “how does my model behave for younger and older users?” or “if due to unexpected circumstances the fraud rate is doubled next year, will my model still be able to detect this?”
In this case, a classifier is trained on the original data, and its performance is evaluated on different datasets created from a generative model. The image below shows the performance of a Random Forest and a Neural Network trained on the same dataset, but evaluated in three different scenarios. Although the Neural Network has better performance on the original set, the Random Forest is much more robust under population shifts.
Data acquisition processes are not perfect, which results in erroneous data points and missing values among other problems. General-purpose generative models can solve this problem by imputing data to values that have certain characteristics.
To do so, the information previously learned by the generative model is used to replace specific values with synthetic ones. The new values will keep the statistical information from the original dataset while the rest will be returned as they are. After imputing values, the user can directly use the whole dataset without having to remove it from outliers, missing values, or erroneous data.
There are different techniques to impute missing values. The figure below compares three of the most popular data imputation techniques (KNN, MICE and Simple Imputer; see references for more information) with data from a general-purpose generative model.
Generative modelling enables powerful anonymisation of sensitive datasets whilst maintaining the utility that is often lost when applying classical anonymisation methods such as masking and grouping.
This approach to data anonymisation has a clear advantage compared to classical methods; your anonymised dataset no longer maintains the one-to-one mapping back to the original data as every row is generated by a generative model. This simple, yet powerful fact can significantly reduce the risk from data leakage when utilising sensitive datasets.
The visualisations and examples mentioned within the blog post were created using the Synthesized SDK in Google Colab.