Synthetic Data in Machine Learning: What, Why, How?

In this episode of the Mind the Data Gap podcast, we have an extraordinary guest from the data science and machine learning community, Vincent Granville. Vincent co-founded Data Science Central, which is a popular portal that covers data science and machine learning.

He currently leads the popular ML Techniques forum on machine learning and data science. We're looking forward to discussing the topic of synthetic data for machine learning: what synthetic data is, the problem it solves, the benefits and value it delivers, and also some historical context as well.

Vincent spent about 20 years in the corporate world at eBay, Wells Fargo, Visa, Microsoft, NBC, and others. He holds a Ph.D. in Mathematics and Statistics, and he's a former postdoc at the University of Cambridge in the stats lab.

Introductions

Simon: I’m Simon, the Machine learning Lead at Synthesized. I look after the development of any sorts of frameworks of degenerative modeling and figure out how to best apply them to the finance, insurance, and healthcare industries. Prior to that, I worked as a Machine Learning Engineer in the legal and healthcare industries. Before that, I studied Physics and Natural Language Processing at Cambridge.

Vincent: I earned my Ph.D. in Mathematics and Statistics in Belgium, in 1993, working on computational statistics, and then moved to the Stats Lab at Cambridge to work on a postdoctorate, which I finished in North Carolina. Afterward, I moved into the internet industry and worked for a number of companies such as Wells Fargo, Visa, eBay, and Microsoft. Eventually, I created Data Science Central, which is known as a popular community for machine learning practitioners. In 2020, it was acquired by Tech Target, a publicly treated company. And right now I'm trying to restart something similar which is MLTechniques.com.

Nicolai: We started working on synthetic data with the team and have been working a lot on the modeling, and different techniques to generate synthetic data. We published an article about applications of synthetic data for machine learning. That's how we started the conversation with you online and that's what led to this podcast.

What generated your interest in synthetic data?

Vincent: I've been working on synthetic data for a very long time. In the beginning, it was simply creating random numbers, but choosing some original techniques that were based on irrational numbers rather than the rational or random number, a pseudo number or generator. Over time, while working independently for Data Science Central, I found that I could create a lot of data and that I have a significant interest in number theory, not the theoretical number theory, but experimental mathematics or probabilistic number theory. This entails a lot of experiments such as machine learning.

Have you interacted with synthetic data in the corporate world?

Vincent: That was something of interest. For instance, when I was working at Visa around 2005, the problem was to identify fraud, particularly credit-card transaction fraud in real-time. And 4 transactions out of 10,000 were fraudulent. So the data set of the fraudulent transactions was small, despite having 50 million transactions. It's not that small, but small compared to the whole data set. Back then we didn't use synthetic data for machine learning, we worked to rebalance the data. It's probably why synthetic data has suddenly become more interesting and valued.

Simon, do you see the same issues when you’re working with the customers?

Simon: Definitely. One of the main use cases we're seeing at the moment at Synthesized is dealing with these heavily imbalanced data sets, where there's only a very small number of actual fraudulent transactions. ML Engineers require more access to more data to try and produce more robust and reliable models. It’s interesting that it’s only starting to be used now, given this problem's been around for such a long time.

Nicolai: And what's interesting is the fraud model use case is different from the privacy use case, which synthetic data is mostly used for. For example, we want to create a sample, which is privacy-preserving and can be shared between teams. But rebalancing is not related to privacy. It's more about how we can improve the efficiency and the prediction performance of a given model, reduce uncertainty in an underrepresented class, and quantify the risk to reduce uncertainty.

And Vincent, in your opinion, what problem does synthetic data solve?

Vincent: One example I’ve seen is digital cognition. There’s a famous data set with around 60,000 images representing digits. If you incorporate synthetic data, in the end, these are going to be imaged with handwritten numbers but written by software, and eventually, there will be a clustering problem. But I’m referring to one of the simplest examples whereby clustering in synthetic data has been successful. You can add a lot of digits that nobody would handwrite. It’ll enrich your data set considerably.

Synthetic data fits, and it is well designed. It's supposed to be much richer, and that helps. And one thing that people are talking about a lot is augmented data, which involves blending actual data, observations, and the training set with synthetic data. And that seems to generate some success these days.

Simon, do you see common patterns in terms of the problems that synthetic data solves?

Simon: Definitely. In scenarios where the privacy of the original data isn't an issue, we're seeing what synthetic data can add to original data. There’s no need to solely work within a synthetic framework. Instead, we figure out what else we can do to train data to enrich models. And it raises questions within the machine learning community. People often get confused if you've got some original data that contains some original information. What's adding synthetic data that's been derived from the original data going to do? Where's the new information being added to this system, and how can you use synthetic data for machine learning to improve the performance of a model?

It's an interesting set of questions when you try and think about where the original information source is. But synthetic data enables you to reshape and augment your original data. And it allows you to amplify the signals of information that your machine learning models are using for their classification tasks etc. If you're not doing transfer learning when you're creating the synthetic data for machine learning, there is the question of what extra information are you supplying to your model.

Vincent: Yes. Regarding the value of synthetic data, particularly in my case where I try to develop a lot of new testing, a new algorithm, and benchmarking when you have a training set you're going to split the training set into two parts. The first is validation, and then you use the validation as if you didn't know the answer. And then you can check whether your predictions are good.

And Nicolai, you talked about a recent paper on the case of regression that when trying to create synthetic data for machine learning with certain features, the pre-specified metrics are the same as the training set data. There were different kinds of regression techniques that I tried. One of the techniques is more like kriging which is similar to an interpolation technique. On the training set, you can provide exact prediction, and as you would expect on the validation set it’s going to work much less. But the problem also I found with the classical type of regression, is that on the validation set, the performance was almost the same as on the training set. I was surprised because I expected a drop in performance. I found that there was a flow in my synthetic data. And the way to make things better and richer in the synthetic data is to use a lot of parameters that allow you to create various types of distribution.

A quick fix to get a more realistic situation in the case of my synthetic data is to increase the amount of noise in the validations compared to the training set.

Some of the issues I see with my synthetic data in that particular case are that the observations are independent. The variance of the noise is constant. In order to create good synthetic data, you should create data that has the same flows as the real data. The variance changes based on observation or a group of observations and the assumption of independence is violated. In my case, I used logistic distribution and some other distributions, but I was using the same distributions.

Are you going to get more rich synthetic data for machine learning by trying tons of logistic or uniform distribution, or other types of distribution? If so, you have to be careful about model identifiability. In essence, you might create data from two different models, but statistically, it’s indistinguishable. In statistical terms, it's called the identifiability issue of a model. So to make your synthetic synthetic data richer you cannot say you’re going to have a million parameters in your model, or that you’re going to use tons of different distributions. In one example, rather than using a mixture of distributions I use a superposition of point processes.

It's not that different, and there is an underlying density that creates the ellipse but I used uniform in two or three or four dimensions and simplify here, but there were essentially three types of distribution that produced three types of different types of data:

Cauchy Distribution
Uniform Distribution
Logistic Distribution

I could change Logistic to Gaussian and it would barely make any changes. That's one of the issues with synthetic data for machine learning.

Nicolai: You mentioned this element of bootstrapping data to create tighter confidence sets. Historically, bootstrapping would be used to create very tight confidence intervals, and we would want the samples to be independent, but follow the same kind of correlation metrics. That becomes a challenge when the dataset becomes very high-dimensional. It’s tough to create synthetic data for machine learning automatically when the dimension of your data is high, especially when you don't have too many data points that describe that phenomenon. As a result, we need to think deeply about how we understand those correlations when the sample size is too small, and how we create synthetic data to build confidence intervals when the sample size is very small. We start to think about sparsity and affective dimensions, and it's a big topic.

Drawing on your corporate experience, what are the 5 most valuable or common data sets you have seen in the past where synthetic data can help?

Nicolai: Fraud detection is a problem that we see in the banking and insurance industries. With one of our customers right now, we’re helping them improve the performance of their fraud detection system by 15% with some of the metrics they use for the fraud detection system. They’re also using some deep learning methods, but it is interesting, given the 15% uplift of the model with synthetic data.

Vincent: Recently I spoke with one of the Venture Capitalists (VC) who funded my past company, and right now he’s involved with a financial company, but what they are trying to do is collect data about accidents, and health issues in order to recover money, so the lawyers can take it to court and get some amount of money to the victim. The VC pays the victim upfront based on the fact that somehow he is going to recover the money. There are a lot of cases where the dollar amount is small, and some are bigger but rarer. That's the type of situation where having the ability to create a variety of different cases helps. There are a lot of possible injuries, and it's not a straightforward problem, so that would be a situation where synthetic data for machine learning could help.

Simon: Yes, and the issue of privacy issues is the other side of synthetic data. With sensitive data, you either don't touch it or mask it and have to go through compliance to get approval before you can get access to it. Interestingly with masking processes, that type of data is still susceptible to various linkage and inference attacks.

We've seen that synthetic data can be used to assist with data privacy because we're removing this one-to-one mapping of original data to the masked data. If it's synthetic data for machine learning instead, no particular row correlates to an original person in a dataset. There’s more consideration for use in the healthcare industry and sectors where legal data is used.

Vincent: Also, Nicolai mentioned leakage as a topic in one of the emails we exchanged. I was looking at Kaggle competitions, and there was a competition where you were supposed to predict cancer based on a bunch of observations. There was a feature, which was the hospital ID, and it was encoded. Even though this was encoded, the people who won were successful because they used that ID to make their predictions.The reason they won is that a bunch of patients were sent to specific hospitals, especially when the situation was bad for them. So that hospital ID attached to their name was an excellent predictor of someone having cancer. So this is an example of leakage that synthetic data can fix easily.

Nicolai: 100%, and in fact anonymization is not safe. Companies think they can add some noise to a data point, and then it’ll be difficult to recover. But it depends on the background information that I have about this data set. It depends on many other parameters and the other data points in this data set. And in your case, you can look at the dataset and link it to the original data point. There still hasn't been a solution in the anonymization world, and synthetic data solves that by creating a completely new environment.

As opposed to masking specific data points, we create a completely new world through a simulated environment and then tweak it to create completely new patterns such as fraud patterns, churn patterns, or patterns for underrepresented classes, which is extremely important.

This problem cannot be answered by anonymization at all. It's not secure. But privacy is one of the important topics which synthetic data addresses well, other than rebalancing. We also see a lot of demand for the algorithms to generate synthetic data and we released the version of the product on our website. Everybody can generate synthetic data for free on our website. And to rebalance data, anonymize data, bootstrap data, check fairness and mitigate biases in data sets. We believe that should be free. Everybody can check it out here. We use deep generative models to create and understand correlations in data, and then encapsulate that in the synthetic data set to create confidence sets, bootstrap data, and improve the performance of your models.

We want this to be used by the community. We see companies looking at this from various fields such as finance, healthcare, government, and startups as well. Some companies are looking into privacy, as their data is not sufficient enough. For example, you have unrepresented groups and you want to bootstrap or understand how to make the data set more balanced because nothing impacts your model as badly as an imbalanced data set and poorly organized training sets. It’s important to be able to rebalance efficiently.

What do you think is driving interest in synthetic data for machine learning right now?

Vincent: It may not be that difficult to generate. You can create a lot of additional observations, it enriches the model and applies to many different fields. One other example is operational research and supply chains. Some of the inventions in AI that you see now originated 50 years ago or so. For instance, looking at chip recognition, which is now a topic of interest, I created a lot of synthetic data related to the program of identifying digits. But I could train them to identify shapes or characters in the Chinese alphabet.

Synthetic data not only enriches and allows you to check your clustering algorithm on 10 digits but also on 60,000 Chinese characters, which is good. It allows you to detect very weak signals, and that’s the strength of synthetic data.

And Vincent, do you think people fully understand what synthetic data is and the value it brings?

Vincent: They don’t, but it’s become a hot topic.

And how would you explain what synthetic data is in simple terms?

Vincent: It’s data that is created by an algorithm as opposed to being collected from observations and is designed to emulate real data.

Simon: As a physicist, I resonate with this idea of synthetic data being data that hasn't been directly measured. You said “observed”, and it's that same idea. It's powerful once you consider moving away from only using direct observations because you can start to do anything with the data that you have at hand. You can start to reshape and manipulate it. People often forget that as soon as you start to do that, that is synthetic data. As a concept, it’s just modified data.

Nicolai: There is a difference between data and information that data conveys. Data is a container for the information that is out there. I can think about something and I can write it down on paper and then it becomes data. You can transfer that information. But in fact, it's information I created and put on paper using data. But you can create unorganized data. If you write with bad handwriting, and you don’t follow the structure then it’s bad data.

Vincent: Another topic I’d like to mention concerning the information is the entropy matrix, which is important.

‍Nicolai: 100% and it's a huge topic in statistics. It’s about how I can encapsulate that information well. We have several studies in statistical theory, and mathematical statistics about it. We should know how to best incorporate it into this space so that we are fully representative. Data becomes a medium for the information which it conveys. And if it is a medium it should have parameters, and then you can create an optimal configuration, which becomes your container, and subsequently becomes synthetic data.

Depending on how good the algorithm is, you can optimally incorporate the information, to optimize those parameters, whether that's quality, privacy, or compliance requirements. And then you can share that piece of paper better. You end up with an elegant interpretation of the information and can share or distribute it having met those requirements.

That connects to entropy because you can ask what the entropy of this information is. If it has some parameters, what's the minimal possible shape that this information can be contained in? I want to know this so I can share it without losing any information because you don't want to have redundant data set. If you think of a huge piece of information, you don't want to encapsulate it in a badly organized package.

‍Simon: It touches on another issue with data that synthetic data for machine learning can help to solve, and that's data representation and data storage. We have this increasing amount of data that's growing in the world, and it's getting harder to run all of your tests or train all of your models with huge amounts of data. There's also this move of looking to create optimal data sets or “golden” data sets where you've got everything that you need to be captured in a smaller container. That's much easier to manage and look after than a huge data lake of information that’s scattered throughout many stores or is barred by compliance. The smaller you can get the data the better.

Vincent: Another topic is wide data, which is growing in popularity. You might have around 200 observations and 1000 features. Synthetic data for machine learning can add a lot more observations with a large number of features, so if you at least do some type of regression you're not in this situation where you have much more features of observation, which in some algorithms going to create problems.

Nicolai: 100%, and we also see many financial institutions moving away from sharing data to sharing models. This is an interesting shift, especially for application development tasks in an enterprise, where you can work with a model. If I give you a generative model of this data set, you can query it. You have an API, query it, get a data set and start developing against it.

You don't need to have access to the original data. You need access to the information and the correlations. Once you have the access to be able to reproduce data, following the same correlations, if your model is good enough you can use that. That's what we see with many companies right now, they often try to use models and share models for development and experimentation purposes.

What's your view on the challenges of creating and sharing synthetic data for machine learning ?

Vincent: One of the challenges is to make sure that your synthetic data is rich enough and cannot be misled by the fact that you might have 10,000 parameters to fine-tune your data set. You can still end up with poor synthetic data, and that's one of the main issues with what I have been doing to generate the synthetic. If the data consists of images or videos, how do I produce synthetic data? This has been done and I’ve done so myself. But this is data that is not tabular. It's not an array. There are ways to do it by using some models. That could be the next step. Not just generating simple arrays of data but much more.

Nicolai: Yes, and there’s an interesting statistic from Gartner saying that more than 60% of all data is going to be synthetic by 2024.

Simon: The idea of synthetic text generation is much more difficult, but it's come a long way. I've seen automatic image captioning and automatic image generation based on image captions, but much more difficult to use because of the privacy issues associated with them.

How would you measure the accuracy of synthetic data, and what other parameters would you consider when measuring synthetic data for machine learning?

Vincent: There are different ways to do that. Firstly, your synthetic data for machine learning is based on your training set data, but you do a test to make predictions on the actual observations of your real data. The model is trained purely on the synthetic data and you see how it performs. You can blend using some of the training set and synthetic data to see how it performs. This might be the most obvious way to see how your synthetic data is performing. A metric of quality to see if it's rich enough. A matrix such as entropy would be useful as well.

Simon: That is definitely the most logical way to assess the quality of your synthetic data for machine learning . We could use statistical measurements and compare the distributions to see how great the distances between the distributions are. But the reality is that synthetic data is created for a purpose. Does it help with that purpose if it's improving the performance of a model or whatever it may be used for? That would be the best measure of its quality.

Nicolai: Yes, and with companies, you see that they want to know if they can use a data set for their specific purpose, whether it's training an ML model or creating some data sets for a data hub or space for innovation. It’s about making sure that we can generate those. When you work with clients, there are also those additional criteria such as time. In our case, we can take 10 minutes to create a very high-quality representative data set. It could have 1 million rows and 100 columns already in one dataset in the cloud, but it depends on the infrastructure. We can change it slightly, and there are enterprise components that come into play when you move this to production.

The aim is to make sure that data can be used for a specific task such as the uplift of a fraud detection model. We recently helped two or three companies to uplift their fraud detection models very quickly and the performance was incredible. Ultimately it's important how quick and how robust the algorithms and the software are.

Also a huge topic we are focused on right now is tables and tabular data generation which we’ve done automatically at speed and scale. Think of data, Postgres database, and Oracle database because we know that tables are ultimately stored in a database.

And then it’s a question of whether you can generate the entire database. You have to think not only about correlations within one table but also correlations between different tables: foreign keys, primary keys, and making sure referential integrity is preserved. We've made a lot of progress and released several components where we can generate the entire database at speed while meeting quality requirements in terms of referential integrity, and making sure foreign keys and primary keys are preserved. But it’s not easy, so kudos to our ML team for making that breakthrough. Now we can generate huge databases at speed and scale with hundreds of tables, complex dependencies, and different differential integrity as well.

Vincent: Now that you're talking about correlations, one topic, we didn't discuss is time series and auto-correlated time series with synthetic data. What you can do is amazing. You would not even imagine that some Brownian motion would generate a strong cluster structure. Imagine time series that are smooth, they use metrics like the Hurst exponent that tells you how smoothly you can incorporate very long-range autocorrelation in your time series or other things, but the potential is great.

Simon: Yes, and it extends into text generation as another form of a time series, but looking at it in a different light. Once you move into synthetic time series, there are many different options for what you can do. In some sense, forecasting is synthesizing but in a new period, and you can backcast or synthesize data in the same period as your original data, but with different subjects or samples.

Vincent: This brings me back to when you asked me which customers could use the synthetic data. I'm thinking about time series, and Wall Street is a potential client.

Nicolai: Definitely. We recently incorporated it into the software, so we currently support time series data generation at scale. Another thing we look at is data coverage. It's about making sure that we can cover all dimensionalities with synthetic data. Even if it's not covered with the original data, we can extrapolate and understand the criteria of the data.

It’s related to what we see in images and self-driving cars. For instance with Tesla, we wouldn't necessarily be testing a self-driving car on the roads of Seattle, that would not be ideal. We would create a simulated environment, and then test the different behaviors of your ML engine in that simulated environment. Once you've set up the simulated environment, you can create different behavior. This is to cover all parameters and situations so that you can test against them better before the car is put on the roads of Seattle. This is what we see in structure data as well, so we can understand elements such as the coverage of different scenarios and how to make sure that we cover against them.

Synthetic data becomes better than production data in terms of covering those scenarios and being able to test against them before they happen in production, which may ultimately cost incredible amounts to the business and significant damage to the business as well, which is not ideal.

Simon: One use case for synthetic data for machine learning that we haven't touched on is data bias. I maintain a repository called FairLens, and it's an open-source software looking at the different statistical ways to measure and assess bias in different forms of data.

In terms of using synthetic data for machine learning , what’s your opinion on how synthetic data can be used to combat data bias?

Vincent: That's not a topic I've worked on, but it's well known and important. I've heard more about bias in the algorithm that processes the data with some rules that might favor others. Eventually, it shows up in the data, which answers why some people get loans more easily than others. You see this bias in the automated decisions and the resulting data set. In the end, these algorithms have been written by human beings, so even if it's automated, you see the same bias. You would think that you could generate synthetic data that would avoid those biases. In an ideal world, it would show as little bias as possible.

Nicolai: Definitely. That's a good point because there is a slight difference between algorithmic bias and data bias. Algorithmic biases are more about deterministic algorithms and rule-based algorithms, which can inevitably introduce some different biases in the way predictions are made. There is also a data bias, meaning that there is a discrepancy between the distributions for different underrepresented groups and how those distributions are different when we think of the entire data sample. We don’t want that. Think of this as unsupervised learning.

We have high dimensional distribution, but when we project that distribution into a class, which may be underrepresented, the distribution is different and we should never allow that. The problem is that data does not only go to Machine Learning Engineers. We develop some algorithms and test for biases. We are doing a lot of work with the team to encourage people to test for biases, but at the same time, they know how to test for that. Think of those data sets which are heavily imbalanced against underrepresented groups, going to test engineers, marketing teams, and sales teams.

Those are not ML engineers. They don't necessarily have algorithms to test for biases. And they use those data sets to potentially make decisions, which is not great. We work hard to identify and mitigate biases in data by means of synthetic data, using the world’s first open-source project. You can use synthetic data to balance those underrepresented groups and make sure that you treat everybody equally. Synthetic data is a great application for rebalancing these groups and making sure the data set is representing all groups, even ones in the original data set that are underrepresented so they can be represented well in the synthetic data set.

Simon: We’re certainly trying to improve the state of bias and fairness within data. It’s a difficult goal to try and resolve, and we don’t know how possible it is yet. But you can certainly look at ways you can use synthetic data to change the original biased data that you're working with.

Vincent: In the future, you could have potential investigations where you have to compare your actual observations with synthetic data, which is supposed to be perfect or unbiased, so good decisions could be made based on how much bias is found compared to the synthetic data.

What's the most interesting topic you want to cover in your blog on ML techniques?

Vincent: Synthetic data is one part of it, and then there is explainable AI, which includes interpretable machine learning. It looks similar to deep neural networks but different in the sense that you get answers that are easier to interpret.

What are your three favorite machine learning books?

Vincent: Recently I published a new list of more than 10 books that are modern and popular. One is called Probabilistic Machine Learning. There are many books that are self-published now, that are much less academic in style. I will share the rest.

And what are the most interesting releases or events you’re looking forward to this year in ML?

Vincent: There are books that I'm writing. I have plenty of time to summarize all the research I've been doing for decades. There will be a new book around Intuitive Machine Learning. It'll discuss new approaches in machine learning, focusing on simplicity and efficiency. And a lot of the things I'm doing are related to video and images. One exciting project that I have in mind is to create a synthetic video. I would like to make a mathematically generated soundtrack that is related to the data from the video. I don’t think anyone has ever done that.