This is some text inside of a div block.

TOM BADAL

SOLUTION ARCHITECT, SYNTHESIZED

A Guide to Testing Data and Optimizing Data Coverage

You can also download this report

Download as PDF

EXECUTIVE SUMMARY

QA and Test Engineers often face difficulty accessing, understanding and ensuring optimal data coverage.
‍
Data silos and poor signal-to-noise ratio are common causes but what else should be considered to avoid these issues?

In this guide, you’ll learn about:

how the current data-centric applications are failing agile environments
what techniques you can deploy to improve data quality and access, and
why ensuring optimal coverage for testing can save you many hours of manuallabor and unnecessary expenses.

Why are data silos problematic and traditional data-centric applications

You’re probably familiar with this classic schema and may even use it at your organization:

Data, which can come from a database, third-party data sources, web applications, etc., is the core of the application. Once you’ve gathered all the data, you put it through your ETL tool, which extracts, transforms, and loads it into your data warehouse where it becomes accessible by your test engineers, QA engineers, data scientists, or everyone who needs to access it.

‍

The problem with this infrastructure?

It’s a “data silo”.

Data silo is a collection of data that is held in a data warehouse, or data lake, but isn’t fully accessible to the engineers, or scientists, or people who need that data.

What causes a data silo?

Sometimes, it’s knowledge.

Data engineers are usually put in charge of entire processes — from getting the data from the source into the warehouse and putting it into its place. The knowledge and understanding they need to get the data may require significant training and it makes data access complex.

Other times, it’s privacy.

When you’re GDPR and/or HIPAA compliant, you can’t just grab data and use it freely, as it often contains sensitive user information. At the start of a data project, you must ensure compliance and follow set measures.

For something that would take one week to develop, it can take three to six months just to
get access to the data. A data silo situation emerges — it’s an antagonist for rapid development and agile methods. It slows everything down.

Some issues arising from data silos can be remedied by applying measures such as a Data Mesh, which is a new, emerging infrastructure, but for this discussion, we’ll continue to focus on the problems data access and quality present, and how to avoid them.

‍

Is your data adequate and how do you know when it’s not?

Suppose you deployed relevant measures and put your data into environments that require different data for development, staging or testing.

How do you make sure the data you have is adequate for your purpose? For example, for a testing environment, does it cover all side cases, optimize the coverage, and is big enough for your performance testing?

Without first understanding why you need this data and how you’re going to use it, it’ll be difficult to answer this question.

EXERCISE

Answer these questions with the involvement of your team

Why do you need data?
How are you going to use it?
What are the most important features you care about?

How does your data look like?
What team is going to use it?
How often do you need to access your data?

Why do you need that data? How are you going to use it?

If you care about data quality and you plan to use your data for machine learning, data science, or to build a business intelligence dashboard, you’ll care about high data quality and need something that will look, feel and possibly taste like original data — statisticalproperties will be very important.

If you care about scalability because you plan on doing performance or integration testing, your data will look different because the statistical properties are possibly not that important — you do, however, care about the amount of data you can get.

In both cases, you’ll also need to make sure that high-level information, structure, and differential integrity remain preserved.

DATA QUALITY

High data quality that looks, feels, and tastes like original data
Statistical properties and utility are preserved

Examples:

Modelling
Market analysis
Business Intelligence

SCALABILITY

Access large amounts of data in short amount of time
High level information and structure is preserved

Examples:

Performance testing
Integration testing

Fundamental considerations when looking at and analyzing your data

Getting the data into the test environment

Depending on what type of data you have, you’ll have various ways of getting it into the test environment, and here are the four most popular methods to do so.

1. Copy production data into the test environment

It’s simple and quick.

The advantage of doing this is that it’s production data, so the quality is high. It’ll behave exactly as production and you’ll cover many of the cases that would happen in production.

The disadvantage of using production data is that it’s risky — testing environments are often not as secure as production and the chances of data breaches are high; you could end up with a fine, lose clients and trust because of a data breach.

The production itself is also a large undertaking — getting and putting data into your test environment daily, processing it and running tests take hours, and this option may not be viable for all.

WHAT IS IT?

Copying production data into the testing environment.

TOOLS

N/A

ADVANTAGES

High quality data

Data that behaves like production

DISADVANTAGES

Increased chances of data breaches

Huge amounts of data

2. Use an obfuscated subset of production data

Another way to get your data to your test environment is to make the data useless to malicious attempts by replacing sensitive information with data that looks like real production information.

What is a subset of production data?
You can subset production data with random sampling for a basic database, like the one shown below, with tables (table 0, table 1, and table 2.) Each table has primary keys (pk) and foreign keys (fk) where fk20 refers to pk1, and fk21 refers to pk0. In this example, there are only five samples for each table:

table0

pk0

0.4

3.5

6.1

table1

pk1

fk10

5.3

0.1

2.5

3.8

2.7

table2

pk0

fk21

0.4

4.3

6.1

2.1

If you then randomly under-sample this data, as we did, you’ll see that the referential integrityis broken:

table0

pk0

table1

pk1

fk10

table2

pk0

fk21

Table 2 now has an fk2 that doesn’t have any value in Table 1

Every time you query anything into this new database, it’ll further break everything else.

‍Intelligent subsetting avoids getting random samples, and instead, you get a sample that’sin all the tables:

table0

pk0

0.4

3.5

6.1

table1

pk1

fk10

5.3

0.1

2.5

3.8

2.7

table2

pk0

fk21

0.4

4.3

6.1

2.1

You get a pk0 from Table 0, but then in Table 1, you go to the foreign key that represents thissample and do the same for Table 2.

table0

pk0

table1

3.8

table2

4.3

This way, the referential integrity is kept.

What is data obfuscation?
It’s a traditional method to ensure data privacy where you apply transformation to your dataso sensitive information cannot be found, or at least isn’t obvious.

Original Table

name

age

income

ssn

Jason Packman

jasonp@gmail.com

$2,081

183-9127-931

Emily Smith

emily123@example.com

$4,281

368-8719-921

Anna Johanson

a.johanson@.com

076-0957-942

Elton Dusk

edusk83@tesla.com

$10,817

427-9425-532

Tom Black

black@black.ru

$1,323

500-0137-132

Note the columns with sensitive information such as name, email, age, income, and Social Security Number.

After applying transformations, random emails and names were generated:

Original Table

name

age

income

ssn

John Doe

fam1i0@jchnai.cu

(30,40]

($2k,$5k]

xxx-xxxx-x21

Jane White

ckqifid@caoqj.kdn

(50,60]

($2k,$5k]

xxx-xxxx-x21

Alan Doug

mcuiqp@cjopcgth.cs

(10,20]

xxx-xxxx-x42

Michael Rahm

fmq3ekc@tdiqbn.es

(40,50]

(10k,$25k]

xxx-xxxx-x32

Albert Taylor

cinqiqp@ckwoq.mn

(30,40]

($1k,$2k]

xxx-xxxx-x32

*For the names, we used a fake name generator. For the emails, a random regular expression generator. For the age andincome, K anonymity — normalizing and putting the data into buckets — and for the SSN we used masking

The problem with these traditional obfuscation techniques is that they don’t always work.

Suppose you know nothing about a chap called “Elton Dusk” from the original table,only that he’s 43 years old.

Elton Dusk

edusk83@tesla.com

$10,817

427-9425-532

If you get access to an obfuscated table and you look at the “ages” column, you’ll find that there’s only one person in this age bucket — you would instantly know the income of that person and part of their SSN.

Even if you can’t see the exact numbers, it’s still more than what you should know, and this is an example of where traditional obfuscation techniques fail drastically.

It can happen to multi-billion brands, too. Perhaps you recall the Netflix linkage attach when Netflix published an obfuscated table and observant users correlated patterns with the IMDb database and revealed names in the Netflix database by de-anonymizing values.

The advantages of using an obfuscated subset of production data are that if there’s a one-to-one mapping to production, it behaves almost like production and it’s easy to configure to use it.

The disadvantages are that the resulting data is still mid-quality and subject to data leakage.

There’s still a lot of manual labor involved, often going column by column during the configuration which is a complex operation.

Examples of tools you can use to obfuscate and subset your data are Tonic.ai, Delphix and DatProf.

WHAT IS IT?

Using a smaller portion of the production environment and obfuscating it

Obfuscating techniques such as:
- K-anonymity
- Masking
- Random string generation
- Data shuffling

TOOLS

TONICai

Delphix

DatProf

ADVANTAGES

Data that behaves almost like production

Easy to configure

DISADVANTAGES

Medium data quality

It’s not necessary free of data leakage

Subsetting is a complex operation

Obfuscation requires manual labour and is difficult to maintain

Summary of using an obfuscated subset of production data

3. Mock data generators

Mock data generators generate random values — without looking too much at the data.

To illustrate, let’s create some columns: ID, first name, last name, email address, gender, etc with Mockaroo, which is a freely available mock data generator (mockaroo.com.)

The new data is generated from this configuration:

It looks realistic, but as the generator didn’t really look at the original data, it may be completely disconnected from the original.

How would you know that?‍

If you look at the gender column, you’ll see only one female person, and all the others are gender fluid, non-binary, agender — this could indicate a disconnect from the original data.

Other generators, similar to Mockaroo, provide very low-quality data that may look realistic, but as they don’t look at the original data, the generators may break things and the data may not behave like production.

The advantage of using mock data generators is that the risk of privacy leakage is low because the generated data is completely fake. They’re also easy to use — just plug in the generator.

The disadvantages are that the data quality is low and they’re not scalable to databases, or the scalability is highly complex. Plus, they still require substantial manual effort because you have to select each column you want to generate mock data for.

4. Synthetic Data

Synthetic data is a novel concept and not as widely used yet. It’s a complex generative model that learns the underlying data distribution and creates new data points.

First, you learn the distribution of your original data and then plug in the synthesizer system. The system goes back and forth to the data and learns the distribution, statistical measures, and the information it’s capable of, from the data. It then generates new data which is synthetic — it’s not real, so it’s free from privacy restrictions.

The advantage of using synthetic data is that the data quality is typically the best* (*it greatly depends on the provider, so choose your provider carefully for the best outcome.)

The risk of privacy leakage is low because the data is synthetic — it’s fake. It’s disconnected from the individual users but it behaves like the original data, only it’s not the original data.

While the risk of privacy leakage is low, it’s important to highlight that the IP and some aggregate measures may still be present in the new dataset. Your synthetic data behaves like production data and when you generate transactions and amounts, the total amount might be the same as in the original data. But it’s a different type of problem you’ll need to take care of if/when you come to that.

Once the model learns the data, it’s easy to sample large amounts of data. It’s also highly scalable and you can configure personalized results.

The disadvantage of synthetic data is that it’s challenging to preserve referential integrity because, at some point, you may generate thousands of tables at once and mappings and external references or strict rules may be lost.

For example, if you have a column that refers to an external database and you don’t tell that to your generator, it might understand that it’s a numerical column and generate data not related to the external key — you may lose that reference.

Generating and learning the data is complex, and it’s important your generator can understand the complex relationships. It’s not a technique for instantly understanding what’s happening, and for some processes, it can be difficult to understand what’s happening and why.

Examples of tools for generating synthesized data are Synthesized.io, SDV, Synthea, and gretel.ai.

Synthetic data for non-structured data

So far, we’ve been talking about synthetic data for structured data, such as tables or JSON files.

What about examples of synthetic data for non-structured data such as video or text? Let’s take a look at a few cases:

1. DeepFake is a technology that uses AI to understand people’s faces, voices, and expressions, and is able to generate videos of these people saying things that they haven’t actually said. This video is of someone faking to be Morgan Freeman — the quality is good and if you saw this image, you’d be confused whether it’s real or not.

This technology can be used with malicious intentions to propagate fake news and misleading information, but also to help people speech or movement deprived to communicate more naturally.

2. GPT-3 is a Natural Language model that has been trained on almost all information freely available on the internet, and is able to generate realistic text that follows complex reasoning. In the example below, GPT-3 writes a response to a philosophical essay talking about it — notice the dazzling way it writes.

‍twitter.com/raphamilliere/status/1289129723310886912

3. NVIDIA DRIVE Sim is a simulated environment to train and validate driverless cars. Testers and developers need a safe environment to work on the driverless car, what better than a synthetic world were one can fine tune different scenarios as they please?

www.youtube.com/watch?v=UoPXzzK_g1Q

‍

Synthetic data quality

At Synthesized, we aim for the overall correlation of data sets to remain as close to the original as possible.

Some of these are categorical distributions and some are continuous situations. You can have different data types, but if your synthetic data generator is powerful enough, you can generate very high-quality data.

Ultimately, the synthetic and original data should be similar and the marginal distributions remain close.

Synthetic data-generating tools

Synthetea is a patient generator for sensitive medical and healthcare data — it generates fake data about patients.

It’s healthcare-oriented, so you can’t plug it into any application, but it’s powerful for those in healthcare. It’s open-source with support from a substantial community.

SDV (Synthetic Data Vault) is an MIT project with multiple data generators — it learns the data with powerful techniques, such as GANs and other deep learning techniques. It’s open-source and there’s a community you can talk to. It’s a solid introduction and a great way to get into the synthetic data world in understanding how these processors work.

As it’s an open-source tool under development, it still has some limitations, such as scalability. For commercial options, other synthetic generators may be more useful.

‍

Mixing techniques and tools

In certain scenarios, you could use a mix of these techniques and tools.

When you have a table that’s important to you and you need to preserve it with high quality, you could generate synthetic data for it. The other table you may have, where the data isn’t all that relevant, you can put through a mock generator.

Sometimes it’s viable to mix the techniques and tools and the advantages of doing so can give you high-quality data that behaves like production with a low risk of data leakage.

The disadvantage of a pick-and-mix approach is that maintenance can become hard and require fine-tuning and configuration because, if not done properly, you can end up with the worst-case scenario for each table.

The Synthesized difference

The core technology at Synthesized is a deep learning model that uses state-of-the-art techniques to generate very high-quality data. Once you apply that to your data sources, you can configure data sources, generate data projects and safely collaborate while using different flexible data types and data generation.

The bottom line

Here’s a simplified overview of your data options:

Table

Production

Obfuscated Subsetting

Mock Data

Synthesized

Risk of Privacy Leakage

High

Medium

Low

Data Quality

High

Medium

Low

High

Testing Coverage

High

Medium

Low

High

Risk of Affecting Live Processes

High

Low

Efficiency and Scalability

Medium

Low

High

*Depending on your data provider’s capabilities

Production data is top quality and provides excellent test coverage, but the risk of leakage and time to production is high because of compliance problems.

Obfuscated subsetting is halfway there — it solves some problems with using production data, but the quality is not as good as it should be. Test coverage is also often reduced and time to production is less than ideal. It’s not scalable, and it’s difficult to maintain.

Mock data is fast and the risk of privacy vacancies is low, but the quality and test coverage may be low.

Synthetic data is the future — it satisfies modern-day requirements and competent generators can generate data that looks, feels and tastes like production data. However, it’s not production data, so you can avoid privacy problems and scale (as long as you address the capacity and size issues of your database.)

*The efficiency and scalability of your synthetic data will be dependent on your data provider’s capabilities.

‍

The challenges of ensuring data coverage

Data coverage is understanding how many of the test cases and side cases are covered by the data.

To compute data coverage, you need to compute all possible test cases, and then check how many of them are covered by the data. Once you’ve run unit, integration and other tests, code coverage can then automatically understand which lines of code are run, or not, by your unit test.

What is code coverage?

Code coverage is a measure of source code testing, used to describe the degree to which the source code of a program is executed when a particular test suite runs.

In this test, the overall coverage is 79.5%, which is good, but it can be much better:

On new code

Coverage

77.0%

Lines to Cover

2,442

Uncovered Lines

408

Line Coverage

83.3%

Conditions to Cover

926

Uncovered Conditions

365

Condition Coverage

60.6%

Overall

Coverage

79.5%

Lines to Cover

10,419

Uncovered Lines

1,696

Line Coverage

83.7%

Conditions to Cover

3,522

Uncovered Conditions

1,163

Condition Coverage

67.0%

And this is an example of how function computes a statistical distance between two columns:

The red and green markings tell you how much of these lines are covered per unit test.
As indicated by the green marking, these two lines were covered:

But when the “mode” is in the array as below, the red markings indicate that this line was only partially covered (probably due to the testing cases only having a few values, or when the value error is not checked):

‍

How do the database and the application interact?

Let's look at familiar schema like this:

Data goes from the warehouse to the application (or a data consumer) and it’s difficult to get reverse feedback: how would the application tell the data if it’s good or bad or if it’s what it needs or doesn’t need?

In the traditional schema, there’s no connection between the two because they’re isolated and the process is one-directional. What you need is something where the application sends to the data warehouse so that you can look at the application logs.

In the application log, you’ll know that the application is interacting with the warehouse. You’ll also see how the application is interacting because you can look at the queries and calls, and how the application calls the data warehouse. This way, you understand what your data looks like and what it should look like.

Remember how you split your code into different areas to look at which were random and which were not in the unit test?

You should do the same for your data coverage:

By looking at your application logs, you then divide the data into different rules, making sure the data covers all the rules.

Imagine a two-dimensional space where each rule splits the data domains into different buckets with the aim of ensuring there's at least one data point in each of these buckets.

Schematic map explaining the rule coverage

The green ticks represent an area that would be covered by the data, and the red crosses represent data buckets that won’t be covered.

Mini case study

How a multinational bank increased their data coverage

Splitting data and looking at the logs can make an enormous difference in your datacoverage, as one of our clients found out.
‍
With an application that interacted with the data, it was uncertain exactly what washappening between the application and the data. Manual tests were used to understandthe coverage, which hovered around a disappointing 50% mark. With the Synthesizedmodel and engine, we segmented the space into 175 rules (or buckets), and in the processalso reduced the number of samples from 201 to 99.

Not only was the coverage improved, but the size of the database was also reduced— production data can be huge and it’s a good idea to shrink it — and this client benefited from a much smaller database in the end.

Tools for improving data coverage

Great Expectations is a fantastic open-source tool for setting expectations for your data. It’s well documented, easy to use and has a solid community.

When you’ve extracted these rules, you can then plug in Great Expectations and ensure your data coverage is as good as you need it. In this example, you’d get data fulfilling a rule you previously set at least for 95% of the time:

DBT is normally used as a transformation tool. You plug it into your database and it will transform your data into data that looks slightly different.

But it’s also a data validation tool and works similar to Great Expectations.

In this example, you code up a script that looks at your data to ensure the “payment amount is positive” and that’s what you would expect from the rules.

Once you set up your rules, you can plug DBT in and validate that your data looks as you want it to.

As a bonus, you can easily integrate DBT into your CI process to ensure that every time you commit anything, your data looks exactly as you want it to look:

‍

How do you extract the rules?

There’s currently a great deal of exploration around data and application logs and, most commonly, people use open-source projects that parse SQL queries and logs between databases and the application.

First, you get the query logs and you parse them into rules. Then, you plug in DBT or Great Expectations and transform these rules into expectations. Finally, you validate your data under those expectations.

What if your data doesn’t follow your expectations?

It happens.

GenRocket is a great tool to generate data from rules.

There are four rules here, each of which has a different bucket of “amount balance” — and depending on where the customer is in their balance, they qualify for a bronze or platinum account.

GenRocket can generate data based on your segments, such as the reward level for these particular balances and account numbers:

‍

Pitfalls to watch out for

The biggest problem you may face when working with databases, despite standards, is that people build databases in flexible ways which often end in complex referential integrity constraints.

ML-powered engines go into the database automatically to explore it and then extract the necessary information. But there’s a lot of complex information there — understanding SQL and other languages, queries, rules, and keeping the integrity of the original data structure — and extracting it can be challenging.

‍

All this for data coverage?

We never said it was going to be quick or easy.

It takes thousands of working hours, ML and AI expertise, and a deep understanding of what the end-user will use the data for to come up with a method that covers and balances all the previously covered moving elements.

We’ve been working hard to make it easy not only to choose one, but also to use a combination of the most optimal methods for your specific dataset. You can effortlessly get the data, verify it, set rules and end up with adequate data coverage.

The Synthesized platform takes care of safely accessing high-quality data and adequate data coverage, so all you have to do is focus on your business card.

How do we do that?

We split the original database into segments and rules. We put the data through the rules and look at how many of these segments were covered by the data. We also add an engine that can, given the data and these rules that were analyzed, generate more data that covers all the side cases.

Plug this in, and you can improve your data coverage to the end.

‍

ABOUT THE AUTHOR

Ton Badal is a Machine Learning Engineer-turned Solution Architect. How he got to what he’s doing today is worthy of an indie movie — a fascinating journey from Barcelona to Edinburgh and London before landing at Synthesized.

Ton’s background is more science than engineering, but that only added to his curiosity when he kept facing the same problems engineers face around data quality and analysis and he’s been hard at work at Synthesized to mesh the two worlds together.