Discuss this article on the DataOps Community on Slack!

Connect with other Synthesized users and directly with our engineers.

Go to slack

Share:

February 26, 2021

Author:

Before we dive into how we measure the value of the data, let’s start with what Synthesized data is. At Synthesized we pride ourselves with enabling enterprises to securely work with sensitive data. Unlike typical data masking and data anonymization tools, which can be easily attacked by modern techniques and drastically reduce the data quality, our DataOps platform automatically models complex interactions and hidden features in datasets to generate high-quality data products at any scale while maintaining data utility. This data looks and behaves exactly like original data, but consists of entirely new data points, leading to faster, more accurate training of models. Our unique approach learns the complex statistical relationships in the data, enabling automatic generation of new realistic samples at any volume while preserving the full quality and performance of the original set.

This post will expand on our philosophy of how we evaluate them and why that makes us confident in the strength of our AI system.

At the heart of this problem lies the question: given two datasets with the same schema (one real, one Synthesized) - how can I tell if they match each other well? At Synthesized we have a key philosophy that underscores how we approach this question:

If your sensitive data is complex and multi-faceted enough to require a machine learning solution like ours, then you can’t simplify your measurement step down just to make your life easier. The complexity of the evaluations needs to match the complexity of the data.

This means we:

- Never rely on a single metric.
- Use metrics that are as general as possible.
- Provide great visualisations of our data for human inspection and, importantly, reassurance.
- Measure ‘use case’ performance where possible.

Let's look at the credit dataset taken from Kaggle, a simple dataset that still provides enough complexity for us to elaborate on these details

Here’s our dataset, we have 11 columns of financial information including recent delinquencies, monthly income and the number of open credit lines. We’ve put this through the Synthesized platform.

You can see the original data in pink and the Synthesized data in blue:

So far the data looks good! But, let’s take a cautious approach. There’s likely some complexity hiding underneath that is missing in these plots, which is why we’ve invested a lot of energy in finding comprehensive measurements to find any issues.

Here’s a set of metrics that measure the distance of each column in isolation:

**Earth-Moving distance (EMD)**is used for columns with a small number of unique values, this discrete metric makes few assumptions and takes into account the distance between values. Which some other discrete measures of distance (like Histogram Similarity) may not do.**Kolmogorov-Smirnov Distance (KSD)**tends to be used for continuous inputs. It is commonly used as part of a common non-parametric statistical test.

It’s also really important that the interactions between the distributions of each column are preserved, below we plot a measure of this for continuous values: the **Kendall Tau Correlation**:

Common metrics we use to investigate interactions are:

- The
**Kendall-Tau Rank Correlation Coefficient**is a non-parametric measurement of ordered data that detects when columns exhibit associations. **Cramér's V**is a measurement between 0 and 1 that similarly measures the association of columns. We tend to use it in cases where columns exhibit few unique values.- The McFadden’s pseudo-R2 metric gives the change in the performance of a Logistic Regressor for a continuous variable against a categorical variable. This ranges from 0 to 1 where a high value indicates a large change in performance. We refer to this as the
**Categorical Logistic Regression Correlation.**

All ‘non-parametric’ methods here demonstrate the similarity of the structure of the interactions in our Synthesized data. Testing for dependence between columns is difficult, so using a wide range of metrics like these gives us a great idea about the similarity. These are only a subset of the methods we use to evaluate data, machine learning blurs the lines between many fields of study. We pick through each of these fields for important ways to detect differences in datasets.

The results for this dataset look great, but is this enough to say that our Synthesized data will work for any use case?

After Synthesizing multiple datasets across a wide range of domains to validate our approach we have the experience to say: Yes! Our metrics give us a great insight into performance across many use cases. Which is why we’re so excited for Synthesized to empower businesses.

Depending on particular use cases, we can construct more information about the quality of the Synthesized data. Let’s explore machine learning modelling.

A common use case for data is using it to train a machine learning model. Can we use this to generate more information about the Synthesized data? Of course! Having a concrete use case gives another target for us to hit. Let’s model the chance of *Delinquency* using the rest of the data. Are we able to maintain the same performance just using the Synthetic Data?

Information like this is great, our solution is meant to be applied to real scenarios, not perform well in some abstract statistical sense! One of the great benefits of the Synthesized platform is that we can correct any imbalances in the dataset. For example, we don’t have many examples of *Delinquent* customers in the original data (taking up less than 10% of the data). However, many machine learning methods perform better when this imbalance is addressed, using the Synthesized platform we can reweight the dataset so 30% of customer have experienced delinquency:

A machine learning practitioner is going to be very happy to use the Synthesized data here!

Data is difficult, we need complex tools to understand it - never mind generate it! In many cases, errors can occur in your pipeline without you even realising. At Synthesized we treat these problems very seriously and work hard to understand them. We want to help our partners solve bigger problems, not waste time worrying about data accuracy! That makes us passionate about thoroughly evaluating our product.

This post is just scratching the surface in terms of how we approach data evaluation on our platform. Our philosophy gives us confidence in our product and we’re just getting started with helping businesses solve big problems.

**Relevant Resources:**

Learn about modern DataOps practices and connect directly with your peers, Synthesized users, and our engineers.