Will your data pass the test, or will your test pass the data?

author:

Testing is a crucial stage in the software development life cycle. If set up properly, it can speed up the development process dramatically by increasing efficiency and automation, and most importantly it can drastically increase the number of detected defects before pushing new code into production. Non-detected defects can lead to unexpected behaviour and seriously damage the customer experience. 


The key element of testing is data. 

The system to be tested is fed with data and its outputs are analyzed, ensuring proper behaviour. Depending on the type of testing, the input data will have a certain shape, and each company will be concerned with specific characteristics. For example, a healthcare provider will likely prioritize data privacy and quality, while a big online retailer may be more interested in performance and large volumes of data, and a regulatory financial organization requires maximum data coverage. 


So how can you obtain data that adjust to each situation? 

In this blog post, we explore three different ways to obtain and use test data and compare their advantages and disadvantages. We then dive into how the Synthesized DataOps platform can support you with its Test Data Management (TDM) capabilities to deliver the best outcomes and mitigate the drawbacks of traditional approaches.


Production Data

A typical approach to testing involves directly using production data for testing. When putting together production and testing environments, schemas are designed to match. Testing data will then be as close as possible to live data and will capture all previous user behaviour. Therefore you can expect good data quality and high test coverage.

Granting permissions to access production data from test environments comes with immense risks. Access to test environments is usually less restricted, and therefore the chances of having data breaches are much higher, as it happened to Shutterfly. Customer trust and brand reputation are put at high risk, in addition to the business bottom line being affected. Furthermore, accessing production data can affect live processes and worsen production environment performance. Finally, production databases are usually vast, making the testing process long and slowing the development process, which can translate into unnecessary delays.


Obfuscated Subset of Production

A commonly used technique is subsetting - using a smaller portion of the production database and obfuscating it. This technique solves some of the risks highlighted above, although it speeds up testing execution, obfuscating the data can take a lot of man-power, and the risk of data leakage in non-production obfuscated environments is still high, as demonstrated in an Uber case in 2018. Furthermore, such techniques may reduce data quality, potentially changing data types and schemas, and combined with sub-optimal subsetting, can lead to reduced data coverage.


Mock Data

Another approach is to use a mock random data generator. There are free-to-use tools that can generate random data given the data types and are fast and easy to set up. But as data types, schemas, and other information need to be provided, they are not scalable and quite difficult to maintain. As data quality tends to be quite low, generated data behaviour can differ from production data, thus leading to low testing coverage. Although this approach can be useful for certain situations such as performance testing, it doesn’t offer as much flexibility as one would hope.

So what would the ideal choice be? The optimal choice for testing provides a combination of flexibility and ease of use while still maintaining scalability, privacy and complete coverage.


Synthesized Intelligent Test Data

At Synthesized, we take a different approach to addressing these requirements. Our technology learns the schemas, data types and statistical properties of the product data and is then able to generate unlimited new datasets. When applied to databases, the new data is capable of capturing complex intra-tables behaviour and preserving referential integrity. In addition, the user is able to enforce rules to ensure coherent data generation, and even generate artificial but realistic values for PII fields such as names, addresses, and identifiers among others.

Our intelligent test data is privacy compliant. It has been tested in a variety of domains, and the generated data is able to prevent privacy leaks even under complex attacks such as linkage or attribute disclosure attacks.

In terms of data quality, the Synthesized approach outperforms other traditional methods. As the new data behaves just like production data, test coverage is similar to production. And once the tool has learned production data behaviour, it is able to generate large amounts of data in minutes, making it scalable to non-functional tests such as performance tests.

The table below illustrates the comparison of the four techniques presented before.

Comparison of techniques to obtain and use test data


But the Synthesized testing capabilities don’t end here. As we will expose in future posts, the Synthesized testing toolbox is able to generate an optimized test set to increase data coverage while keeping the size of it as small as possible. In a recent customer engagement with a leading European bank, we validated this approach, increasing test coverage from 45% to 100% and reducing time & costs by 90% when using the Synthesized platform to intelligently generate test data.

Synthesized testing capabilities outperform traditional TDM platforms on many fronts such as flexibility, scalability, privacy compliance, data coverage and optimization. In following blogs, we will explore the Synthesized testing capabilities in more depth.


Recommended resources

Synthesized blog

Learn what we've been up to

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.