SYNTHESIZED DATA QUALITY REPORT

Practical Guide to Data-Driven Testing

You can also download this report
Download as PDF
EXECUTIVE SUMMARY
Testing is a crucial stage in the software development life cycle. If set up properly, it can speed up the development process dramatically by increasing efficiency and automation, and most importantly it can drastically increase the number of detected defects before pushing new code into production. Undetected defects can lead to unexpected behaviour and seriously damage the customer experience.

Data plays a key role in testing, so how can we obtain data that adjusts to each situation? The optimal choice for testing provides a combination of flexibility and ease of use while still maintaining scalability, privacy and complete coverage.

In this guide, we introduce you to the fast-growing concept of data products (Data as aProduct). Practitioners can apply product techniques to data and get access to high-quality, fully-compliant and large volumes of test data on demand with the same schemas, data types and statistical properties as the production data.

We deep dive into our all-in-one DataOps platform enabling intelligent test data and collaboration across internal teams and external partners. We explore the platform’s capabilities and its advantages in comparison with traditional approaches of data-driven testing.

Data — the essential element in testing

The system to be tested is fed with data and its outputs are analyzed, ensuring proper behaviour. Depending on the type of testing, the input data will have a certain shape, and each organisation will prioritise specific characteristics. For example, a healthcare provider will likely prioritize data privacy and quality, while a large online retailer may be more interested in performance and large volumes of data, and a regulatory financial organization requires maximum data coverage.

So how can we obtain data that adjust to each situation?
Three types of test data that can be obtained; obfuscated production data, mock (fake) data, and intelligent test data. Each test data type has its advantages and disadvantages.

Production data

A traditional approach to testing involves directly using production data for testing. When putting together production and testing environments, schemas are designed to match. Testing data will then be as close as possible to live data and will capture all previous user behaviour. Therefore you can expect good data quality and high test coverage.

Yet, granting permissions to access production data from test environments comes with immense risks. Access to test environments is usually less restricted, and therefore the chances of having data breaches are much higher, as happened to Shutterfly recently. Customer trust and brand reputation are put at high risk, in addition to the business bottom line being affected. Furthermore, accessing production data can affect live processes and worsen production environment performance. Finally, production databases are usually vast, making the testing process long and slowing down the development process, leading to unnecessary delays.

Obfuscated subset of production

Another commonly used technique is subsetting — using a smaller portion of the production database and obfuscating it. Although it speeds up testing execution, obfuscating the data can take a lot of manpower, and the risk of data leakage in non-production obfuscated environments is still high, as demonstrated in an Uber case in 2018.

Furthermore, such techniques may reduce data quality, potentially changing data types and schemas, and combined with sub-optimal subsetting, can lead to reduced data coverage.

Mock (aka fake) data

Another approach is to use a mock (fake) random data generator. There are free-to-use tools that can generate random data given the data types and are fast and easy to set up. But as data types, schemas, and other information need to be provided, they are not scalable and quite difficult to maintain. As data quality tends to be quite low, generated data behaviour can differ from production data, thus leading to low testing coverage. Although this approach can be useful for certain situations such as performance testing, it doesn’t offer as much flexibility as one would hope.

The optimal choice for testing provides a combination of flexibility and ease of use while still maintaining scalability, privacy, and complete coverage.

Intelligent test data

In an ideal world, we want to have access to high-quality fully compliant large volumes of test data on-demand with the same schemas, data types, and statistical properties as the production data. When applied to databases, the new data should be capable of capturing complex intra-tables behaviour and preserving referential integrity. In addition, the user should be able to enforce rules to ensure coherent data generation, and even generate artificial but realistic values for PII fields such as names, addresses, and personal identifiers among others.

Intelligent test data should be privacy compliant. It should be tested in a variety of domains, and the generated data should be able to prevent privacy leaks even under complex attacks such as linkage or attribute disclosure attacks.

Success business metrics for data-driven testing

There are three core metrics for how businesses should measure the success of data-driven testing, whether it is functional (regression, UAT, etc) or non-functional (performance, usability, reliability, etc) types of testing.

AUTOMATION, TIME & COST
There should be an automated scientific way of measuring the test coverage of a test dataset. There should be the minimum time required to set up test cases and create test data.
TEST COVERAGE
The data should have optimal coverage for all test cases and test rules.
COMPLIANCE
The process of data-driven testing should be compliant with internal and external data governance and privacy policies.

The Synthesized DataOps platform for intelligent test data

Synthesized is an all-in-one DataOps platform enabling high-quality fully compliant intelligent test data, and collaboration across internal teams and external partners.

The platform enables the creation of new data products in under 10 minutes

Our platform uses machine learning and statistics to learn the schemas, data types, and statistical properties of the production data and is then able to generate unlimited new data products.

The Synthesized approach outperforms traditional methods in terms of data quality. As the new data behaves just like production data, test coverage is similar to production. And once the tool has learned production data behaviour, it can generate large amounts of data in minutes, making it scalable to non-functional tests such as performance tests.

Comparison of four techniques to obtain and use test data:

Table
Production
Obfuscated Subsetting
Mock Data
Synthesized
Risk of Privacy Leakage
High
Medium
Low
Low
Data Quality
High
Medium
Low
High
Testing Coverage
High
Medium
Low
High
Risk of Affecting Live Processes
High
Low
Low
Low
Efficiency and Scalability
Medium
Low
High
High

Synthesized testing capabilities outperform traditional TDM platforms on many fronts such as flexibility, scalability, privacy compliance, data coverage, and optimization.

Synthesized intelligent test data is privacy compliant. It has been tested in a variety of domains, and the generated data can prevent privacy leaks even under complex attacks such as linkage or attribute disclosure attacks.

Synthesized testing capabilities go even further, being able to generate an optimized test set to increase data coverage while keeping the size of it as small as possible.

The approach was validated in a recent customer engagement with a leading multinational bank, increasing the test coverage from 45% to 100% and reducing time & costs by 90%. Furthermore, the platform delivers on the core success metrics:

PROCESS AUTOMATION
Synthesized illustrates an automated scientific way of measuring the testcoverage of a test dataset. There is currently no automatic way of measuringthe coverage of a test dataset.
TEST COVERAGE
Synthesized generates new test datasets with 96-100% coverage with regards to the framework provided, for all test cases and specified test rules (current coverage is 40-50%).
TIME & COST SAVINGS
Synthesized reduces the manual time and effort by approximately 90% themanual time and effort required to set up test cases and create test data whenusing the Synthesized platform.

Synthesized platform capabilities

Synthesized is an all-in-one DataOps platform enabling high-quality fully compliant intelligent test data, and collaboration across internal teams and external partners.

CORE CAPABILITIES

  • Full database generation for testing including data subsetting, maintaining complete referential integrity
  • Data coverage assessment for digital testing
  • Test Data as a Service (DaaS) & self-service model
  • Data migration between database to database (SAP HANA to SAP S/4 HANA, Hadoop to Oracle, SAS to Oracle, etc) including any relational file format
  • Changes to existing files or tables
  • Data integrity or data sync up across databases

UNIQUE CAPABILITIES

  • Intelligent data scenarios for data science, analytics, and digital testing
  • Support for incremental refreshes

ENTERPRISE CAPABILITIES

  • Enterprise-grade security
  • Role-based access control
  • Single sign-on
  • Deployment & integrations (private cloud, on-premise)
  • Support field level decryption/encryption with Hashicorp vault and support rate limit
  • Data formats support
  • Infosec requirements

Synthesized core capabilities

Full database generation for data-driven testing

The Synthesized Database Generation capability allows the user to generate a privacy-compliant version of a database for digital testing in minutes.

All meta-information (such as table names, columns, primary and foreign keys, data types, column distributions among others) can be preserved, but the actual data contained is synthetic. Once Synthesized learns the structure of the database, it can generate as many rows per table as needed.

The following attributes of the original database are preserved in the Synthesized copy:

  • Tables and columns names
  • Columns data types
  • Primary and Foreign key
  • Foreign key distributions (approximated)
  • Column probability density distributions (approximated)
  • Textual data (character encoding and formatting)

There is no limit to the number of tables in the database. Synthetic data produced by Synthesized as part of the full database generation preserves the referential integrity of relevant fields and values linking multiple tables, including where these fields constitute synthetic PII appearing in more than one table.

Data coverage assessment for digital testing

During the assessment of original data, the platform measures the completeness of test data and its suitability for testing purposes. Test data must be optimized and complete.

The Synthesized platform provides a scientific framework for measuring the test coverage of functional mappings associated with test data. During the process of database generation the platform can understand how synthetic data should look like and optimises the coverage of the test database making the data complete for digital testing.

Test Data as a Service (DaaS) & self-service model

To build, deploy, monitor, and access a test data product, Synthesized platform provides the infrastructure for testing teams to autonomously own their test data products and have access to a high-level abstraction of infrastructure. It removes the complexity and friction of provisioning and managing the lifecycle of test data products, in a self-service manner to enable domain autonomy.

Intelligent test data product is a fast-growing concept in the data world that applies product techniques to data.

Data product = Synthesized data* + Ownership + Business Purpose
(functional or non-functional)

*Synthesized data consists of either: original data, augmented test data (reshaped original data), intelligent synthetic data, hybrid data (mixture of augmented original data and fully synthetic data) or differentially-private data, depending on the business requirements.

Data migration between one datasource to another

Synthesized DataOps platform enables testing teams to migrate production workloads to the cloud faster and with greater confidence. Using the SDK and the RestAPI the engineers can sync production data to the cloud. Then the user can provision, clean up, integrate, and version data to drive testing, cutover rehearsal, and production support.

The complete list of supported databases for testing migration of production workloads to the cloud is available in the section “Deployment & Integrations” below.

Changes to existing files or tables

Typically some attributes in a production dataset may have underrepresented groups and categories. Scenarios like these can lead to unexpected outcomes, such as reduced testing coverage.

The data rebalancing feature of the platform allows the user to alter distributions as desired, and rebalance datasets by generating realistic samples for the underrepresented groups. This feature facilitates rapid prototyping of data-driven solutions for tightly defined problems in software engineering and also testing.

Data integrity across databases

Synthesized DataOps platform automatically monitors data source systems to capture ever-changing original raw data. As data changes, the system ensures that development, test, and non-production environments have optimised, fresh data where required. The system does this without installing agents on production systems and without impacting production performance.

Synthesized unique capabilities

Intelligent data scenarios

Unlike alternative solutions to data provisioning, the Synthesized platform can incorporate intelligent data scenarios into the data products, making them complete and improving the coverage of data. It enables businesses to evaluate the performance of a system or a model in a much broader range of scenarios compared to the range of scenarios contained in the original data.

Intelligent data scenarios are created with the Synthesized platform by defining custom subsets or groups of the original data that have user-specified properties and distributions. Each group can be labeled to define a specific scenario, and an arbitrary number of realisations of these scenarios can be generated. Intelligent data scenarios can be used to amplify the value of an existing dataset.

For example, a range of scenarios can be generated to evaluate the behaviour of a model under significant population shifts in the data, or in the case of fraud detection, solutions can significantly suffer from having not seen enough fraudulent examples in original data. The Synthesized platform can scientifically amplify groups in the data to ensure solutions built using synthesized data products that have optimal performance.

Intelligent data scenarios are captured in the snapshot below.

Support for incremental refreshes

The Synthesized Database Generation can be configured to incrementally refresh if the source data changes. Instead of reloading all the data from a database, the data from individual tables is incrementally synthesized where possible to avoid unnecessary queries and data transfers.

Synthesized enterprise capabilities

Enterprise-grade security

Synthesized supports industry-standard security protocols including Secure Sockets Layer (SSL), JWT Tokens and Bcrypt cryptography. Synthesized will have no access to any data, including usage data, processed or generated by the platform. Synthesized output data is designed to be compliant with standard industry regulatory data governance and privacy frameworks including GDPR, HIPAA, and CCPA.

Role-based access control

Synthesized supports separation of privileges, admin features, LDAP and AD integrations.

Single Sign-On

Synthesized integrates with SAML 2.0, OpenID, and Active Directory.

Deployment & integrations

Synthesized supports datacenter / on-premise installation, private cloud (Google Cloud, AWS, Azure, OpenShift) and public cloud upon request. It can connect to any relational database including and integrates into existing ETL engines if needed.

Source and target database servers and OS platform include:

• Oracle

  • Oracle 9.2.0.8, 10.2, 11.1, 11.2, 12.1
  • Solaris (SPARC/X86), RHEL, OEL, SUSE, AIX, HPUX

• SQL Server

  • SQL Server 2005, 2008, 2008R2, 2012, 2014
  • Windows Server 2003SP2, 2003R2, 2008, 2008R2, 2012, 2012 R2

• PostgreSQL Server

  • PostgreSQL/EnterpriseDB Postgres Plus Advanced Server 9.2
  • RHEL 5, 6

• SAP ASE Server

  • SAP ASE 12.5, 15.0.3, 15.5, 15.7
  • RHEL 6.2, 6.3, 6.4 | Solaris 10 (SPARC/X86)

• MYSQL Server

  • MySQL Community 5.5, 5.6, 5.7.7 | Enterprise 5.6 | Maria > 10.0.10
  • RHEL 6.2, 6.3, 6.4 | Solaris 11 (SPARC/X86)

• DB2 Server

  • DB2 LUW 10.1+
  • RHEL 6.5+, AIX 6.1+

It is compatible with big data storages including Google BigQuery and Amazon Redshift.

If this list does not include the integration options you need, reach out by email at contact@synthesized.io

Data formats support

Synthesized supports any structured data formats including CSV, TSV, XLS, Parquet, JSON and XML.