This is some text inside of a div block.

SYNTHESIZED DATA QUALITY REPORT

Generating Data and Maintaining Referential Integrity with the DB Synthesizer

You can also download this report

SUMMARY

The DB Synthesizer is Synthesized’s tool for synthesizing data from one database to another. This is a deep dive where we’ll look into the results of the synthesization including the quality of the generated data and how referential integrity is maintained.

Getting started

So let’s dive in. Let’s start with some questions around how the DB Synthesizer works:

What information is preserved in the DB Synthesizer?

Tables and column names. The schema of all tables and columns will be copied from the original source
Data types. All columns in the destination database will have the same data type as in the source database

DDL. The DDL for both databases (including constraints, procedures, views, sequences, etc.) will be present in the Synthesized database as they were in the source database

Referential Integrity. Primary and Foreign keys will be copied, ensuring referential integrity is preserved so that the user can query data with join statements and obtain similar results. A configuration file with additional relationships can be provided

Key Cardinality. Foreign key distributions are generated as close as possible to the source database to ensure similar cardinality

Column marginal distributions. Column marginal distributions are approximated and sampled from probability distributions to be similar to those in the original database

‍

What information is the DB Synthesizer extracting from the original database? What information is put into the destination database?

All the structural data and data values will be almost exactly as they are in the original database. Foreign key distributions, which refer to points such as referential integrity and cardinality, will be approximated from the original database to serve data to their destination database.

This means that the destination database is going to have a similar to, but not exact distribution of referential integrity and cardinality as the original database.

The column probability density function, this density distribution is also going to be approximated. So the distributions are going to look quite similar to the original data, but they are not going to be exactly the same data. This results in generated data that is private.

‍

Using the DB Synthesizer to generate new data

Now that you have an understanding of how the DB Synthesizer works, let’s look into an example.

The original database used in this report is from a credit system, and contains historical information about users, and what credit lines, products and campaigns they participated in, among some other information. This database has seven tables along with several primary and foreign keys.

Table sizes in this database go from 10 to 1M rows, and while for the purpose of this report a database with only 7 tables was chosen, the DB Synthesizer is scalable and is able to work with databases of virtually any size (the software has been tested and validated against production size databases with thousands of tables).

The DB Synthesizer can be run using a command line interface, which is going to run the java application this simple command:

$ java -jar target/db-synthesizer.jar \
-u “jdbc:postgresql://39.176.361.124:1234” \
-d credit \
-a synthesize \
-U system \
-W password123 \
-l “jdbc:postgresql://39.176.361.1:1234” \
-s credit_synth

And what you can see here is the origin URL, the name of the database of the origin, and down here, the destination connection, URL and destination database name, and the other parameters that can be actioned.

Here we’ll perform an action such as copying or synthesizing the data. We’ll configure the DB Synthesizer here by adding information about the source database along with user and password information. You can configure these and other parameters.

More information can be found here: docs.synthesized.io

You can see all the other configuration elements that you can, that you can run. You can also run the DB Synthesizer from an API. So you could see it. You can find more information about how to run it here. There are other parameters that can be added to the DB Synthesizer, more detailed information can be found in the configuration section in the documentation.

All of these parameters can be added for extra and more personalized output. So all the information such as primary keys column types, et cetera, will be extracted from the database. But there are some cases where some information needs to be provided manually to the Synthesizer. And with this configuration file, we are able to add that information to the database.

Now that the DB Synthesizer has been run for this database, let’s look at the results. To do so, we’ll connect the original and destination databases and compare the results from both databases. Firstly, let’s confirm the tables in the synthetic database (destination) are the same as the original database.

So we have seven tables, which is the same number as in the original database.

Table Name

Original Shape

Synthesized Shape

products

(22, 4)

customerregistration

(32593, 4)

creditline

(206, 5)

documents

(6364, 3)

customercreditline

(173912, 5)

customerdocuments

(1000000, 5)

customerinfo

(28785, 10)

select * from credit.customerinfo limit 5;

id_
customer

gender

region

highest_
education

imd_band

age_band

num-of-prev_
credit_lines

opened_
lines

id_
product

11391

East Anglian Region

HE Qualification

90-100%

55<==

28400

Scotland

HE Qualification

20-30%

35-55

30268

Noth Western Region

A Level or Equivalent

30-40%

35-55

31604

South East Region

A Level or Equivalent

50-60%

35-55

32885

West Midlands Region

Lower Than A Level

50-60%

0-55

select * from credit_synth.customerinfo limit 5;

id_
customer

gender

region

highest_
education

imd_band

age_band

num-of-prev_
credit_lines

opened_
lines

id_
product

South Region

A Level or Equivalent

0-10%

0-35

East Midlands Region

HE Qualification

0-10%

35-55

South Region

Lower Than A Level

80-90%

0-35

North Western Region

Lower Than A Level

80-90%

0-35

West Midlands Region

A Level or Equivalent

10-20%

0-35

The next thing we’ll look at is referential integrity. Based on the first five rows of this new table, we’re able to see that the data looks very similar to the original database. And then if we look at the shapes of the data in the tables, you’ll see that the shapes and the number of rows and columns of the original and Synthesized tables are the same.

SELECT * FROM credit.customerinfo
WHERE id_customer = 11391

id_credit_line

id_customer

date_start

is_default

min_credit_score

1752

11391

78.0

501

1753

11391

85.0

842

1754

11391

115

80.0

1032

1755

11391

164

85.0

1457

1756

11391

212

82.0

Now, if we do the same exercise and pick one random customer in the synthetic data, for example customer ID 265, this is how the row looks.

SELECT * FROM credit_synth.customerinfo
WHERE id_customer = 265

000

id_
customer

gender

region

highest_
education

imd_band

age_band

num-of-prev_
credit_lines

opened_
lines

dis-ability

id_
product

265

265

South Region

A Level or Equivalent

20-30%

0-35

And then we can go to the customer credit line for the same user in the synthetic database and we’ll see that this one has five rows as well.

SELECT * FROM credit_synth.customerinfo
WHERE id_customer = 265

id_credit_line

id_customer

date_start

is_default

min_credit_score

68624

265

106

89.371582

501

130

265

62.724920

842

168

265

62.297032

1032

265

64.593113

This is a quick example of how referential integrity is preserved. Furthermore, if we join these tables on the customer ID parameter in both the original and synthetic databases, we’ll see that there are some distributions that are preserved even after joining both tables.

For example, if we look at the default rate, we can see that it’s the same for the original database. This also applies to other fields like the minimum credit score. It’s been approximated quite nicely.

Original counts

Synthesizer counts

0.92989

0.92985

0.07011

0.07015

Let’s do the same exercise for the other tables. If we look at customer information, we’ll look at this customer to see how it looks and this customer only has one ID product 15.

Which means it only exists once.

SELECT * FROM credit.customerinfo
WHERE id_customer = 11391

00000

id_
customer

gender

region

highest_
education

imd_band

age_band

num-of-prev_
credit_lines

opened_
lines

dis-ability

id_
product

19542

35860

South West Region

A Level or Equivalent

10-20

0-35

• Original products for id_customer = 11391, that has id_product = 15

SELECT * FROM credit.products
WHERE id_product = 15

product_code

campaign_code

campaign_length

id_product

FFF

2013B

240

While we thought we were looking at a one to many relationship, in this case, we are actually looking at a one-to-one relationship. And you can see that in the synthetic database, it’s also a one-to-one relationship. There’s only one product with ID 14, so the database was able to keep that structural relationship.

• Synthesized customer info for id_customer = 265

SELECT * FROM credit_synth.customerinfo
WHERE id_customer = 265

0000 00

id_
customer

gender

region

highest_
education

imd_band

age_band

num-of-prev_
credit_lines

opened_
lines

disability

id_
product

1739

1739

South West Region

A Level or Equivalent

10-20

0-35

• Synthesized customercreditline for id_customer = 265

SELECT * FROM credit_synth.products
WHERE id_product = 14

product_code

campaign_code

campaign_length

id_product

BBB

2013J

250

And if we join these tables again, you can see in the first five rows there are no duplicates.

SELECT * FROM credit.customerinfo ci
JOIN credit.products p ON p.id_product = ci.id_product

id_
customer

gender

region

highest_
education

imd_band

age_band

num-of-prev_
credit_lines

opened_
lines

disability

id_
product

product_code

campaign_
code

campaign_
length

11391

East Anglian Region

HE
Qualification

90-100%

55<==

AAA

2013J

268

28400

Scotland

HE
Qualification

20-30%

35-55

AAA

2013J

268

30268

North Western Region

A Level or Equivalent

30-40%

35-35

AAA

2013J

268

31604

South East Region

A Level or Equivalent

50-60%

35-55

AAA

2013J

268

32885

West Midlands Region

Lower Than A Level

50-60%

0-35

AAA

2013J

268

And again, here, there are no duplicates. So the referential integrity of this structure was kept.

SELECT * FROM credit_synth.customerinfo ci
JOIN credit_synth.products p ON p.id_product = ci.id_product

id_
customer

gender

region

highest_
education

imd_band

age_band

num-of-prev_
credit_lines

opened_
lines

disability

id_
product

product_code

campaign_
code

campaign_
length

South Region

A Level or Equivalent

0-10%

0-35

EEE

2014J

269

West Midlands Region

Lower Than A Level

20-30%

0-35

EEE

2014J

269

South East Region

Lower Than A Level

30-40%

0-35

EEE

2014J

269

Wales

A Level or Equivalent

80-90%

0-35

EEE

2014J

269

West Midlands Region

Lower Than A Level

60-70%

0-35

EEE

2014J

269

Let’s continue and plot the heat map for these two columns, the age band and campaign code in the original and synthesized tables. After joining these tables, it’s important to highlight that the age band is in one of the tables while the campaign code is in the other table. So age band is in the customer information table, while the campaign code is in the products table. You can see that even after joining both tables, the heat maps are quite similar.

‍

We can plot other things like the campaign length against age bands, and the distributions are very similar on the campaign length.

The distributions are also quite similar when looking at campaign codes and product codes. So you can see how the distributions are kept.

categorical_distribution_plot(df_mdl_orig[‘campaign_code’], df_mdl_synth[‘campaign_code’], title=’campaign_code’)

categorical_distribution_plot(df_mdl_orig[‘product_code’], df_mdl_synth[‘product_code’], title=’product_code’)

‍

Let’s go a step further and join the three tables. So we’ll join credit line with customer credit and customer credit line via this ID credit line. And then we also join credit products on ID product, and we then apply the same query to the synthetic database and plot the distribution.

SELECT * FROM credit.creditline cl
JOIN credit.customercreditline ccl on ccl.id_credit_line = cl.id_credit_line
JOIN credit.products p on p.id_product = ccl.id_product)

SELECT * FROM credit_synth.creditline cl
JOIN credit_synth.customercreditline ccl on ccl.id_credit_line = cl.id_credit_line
JOIN credit_synth.products p on p.id_product = ccl.id_product

categorical_distribution_plot(df_assmnt_orig[‘campaign_code’], df_assmnt_synth[‘campaign_code’], title=’campaign_code’)

categorical_distribution_plot(df_assmnt_orig[‘product_code’], df_assmnt_synth[‘product_code’], title=’product_code’)

‍

‍

You can see that even after joining three tables, the distributions are kept quite nicely. So you can see it for the campaign code and on the product code. And if we look at customers per product, we’ll see the same results.

df_vle_orig = dfs_orig[‘customerinfo’].merge(dfs_orig[‘customerdocuments’], on=’id_customer’, how=’left’)
df_vle_orig.head(5)

id_
customer

gender

region

highest_
education

imd_band

age_band

num-of-prev_
credit_lines

opened_
lines

disability

id_
product_x

id_document

date

sum_click

id_
product_y

11391

East Anglian Region

HE Qualification

90-100%

55<==

546669.0

-5.0

16.0

0.0

11391

East Anglian Region

HE Qualification

90-100%

55<==

546662.0

-5.0

44.0

0.0

11391

East Anglian Region

HE Qualification

90-100%

55<==

546652.0

-5.0

1.0

0.0

11391

East Anglian Region

HE Qualification

90-100%

55<==

546668.0

-5.0

2.0

0.0

11391

East Anglian Region

HE Qualification

90-100%

55<==

546652.0

-5.0

1.0

0.0

Now let’s take a look at the foreign keys. We want to make sure that we preserve the value counts when we’re talking about foreign keys. For example, in the case where we have the customer and customer transactions we would want to make sure that the number of transactions per customer is similar to ensure referential integrity is preserved.

This is known as cardinality which is plotted above. For example, here, all values are unique in the both original and synthetic databases. You can see that this distribution is quite accurate. Most of them only have one occurrence when ID customer number is one. There are 8,000 that only have one occurrence. So the majority are biased for one occurrence, but most of them have multiple occurrences. And you can see a similar situation where referential integrity is approximated quite nicely.

Last but not least, we also need to look at column distribution. If we look at one column, particularly will the distribution be preserved. And here you can see the submissions of the single column for the table customer information.

You can see that continuous and categorical values, are all preserved quite nicely. Obviously the primary key is not preserved because we generated new primary keys to avoid privacy leakage. So the primary keys are generated independently from the, from the original values. And you can see it.

The rest of the values are fairly similar. And similarly, if we look at the customer credit line table again and specifically the ID customer line, and ID customer are not approximated nicely because we generated new values for this, because otherwise these can be identifiers and the database could be attacked with these values, but for the rest of our values you can observe the distributions.

‍

Conclusions

In this report we demonstrated how the DB Synthesizer is able to explore a given database, extract and learn all the information contained within it, and generate an entirely new database whilst maintaining the referential integrity necessary to ensure the proper linkage between multiple tables within the database.

Please feel free to reach out with any questions or if you’d like to learn more to: letschat@syntheszed.io

Generating Data and Maintaining Referential Integrity with the DB Synthesizer

Getting started

What information is preserved in the DB Synthesizer?

What information is the DB Synthesizer extracting from the original database? What information is put into the destination database?

Using the DB Synthesizer to generate new data

$ java -jar target/db-synthesizer.jar \ -u “jdbc:postgresql://39.176.361.124:1234” \ -d credit \ -a synthesize \ -U system \ -W password123 \ -l “jdbc:postgresql://39.176.361.1:1234” \ -s credit_synth

select * from credit.customerinfo limit 5;

select * from credit_synth.customerinfo limit 5;

SELECT * FROM credit.customerinfoWHERE id_customer = 11391

SELECT * FROM credit_synth.customerinfoWHERE id_customer = 265

SELECT * FROM credit_synth.customerinfoWHERE id_customer = 265

SELECT * FROM credit.customerinfoWHERE id_customer = 11391

SELECT * FROM credit.productsWHERE id_product = 15

SELECT * FROM credit_synth.customerinfoWHERE id_customer = 265

SELECT * FROM credit_synth.productsWHERE id_product = 14

SELECT * FROM credit.customerinfo ciJOIN credit.products p ON p.id_product = ci.id_product

SELECT * FROM credit_synth.customerinfo ciJOIN credit_synth.products p ON p.id_product = ci.id_product

categorical_distribution_plot(df_mdl_orig[‘campaign_code’], df_mdl_synth[‘campaign_code’], title=’campaign_code’)

categorical_distribution_plot(df_mdl_orig[‘product_code’], df_mdl_synth[‘product_code’], title=’product_code’)

SELECT * FROM credit.creditline clJOIN credit.customercreditline ccl on ccl.id_credit_line = cl.id_credit_lineJOIN credit.products p on p.id_product = ccl.id_product)

SELECT * FROM credit_synth.creditline clJOIN credit_synth.customercreditline ccl on ccl.id_credit_line = cl.id_credit_lineJOIN credit_synth.products p on p.id_product = ccl.id_product

categorical_distribution_plot(df_assmnt_orig[‘campaign_code’], df_assmnt_synth[‘campaign_code’], title=’campaign_code’)

categorical_distribution_plot(df_assmnt_orig[‘product_code’], df_assmnt_synth[‘product_code’], title=’product_code’)

df_vle_orig = dfs_orig[‘customerinfo’].merge(dfs_orig[‘customerdocuments’], on=’id_customer’, how=’left’)df_vle_orig.head(5)

Conclusions

‍

$ java -jar target/db-synthesizer.jar \
-u “jdbc:postgresql://39.176.361.124:1234” \
-d credit \
-a synthesize \
-U system \
-W password123 \
-l “jdbc:postgresql://39.176.361.1:1234” \
-s credit_synth

SELECT * FROM credit.customerinfo
WHERE id_customer = 11391

SELECT * FROM credit_synth.customerinfo
WHERE id_customer = 265

SELECT * FROM credit_synth.customerinfo
WHERE id_customer = 265

SELECT * FROM credit.customerinfo
WHERE id_customer = 11391

SELECT * FROM credit.products
WHERE id_product = 15

SELECT * FROM credit_synth.customerinfo
WHERE id_customer = 265

SELECT * FROM credit_synth.products
WHERE id_product = 14

SELECT * FROM credit.customerinfo ci
JOIN credit.products p ON p.id_product = ci.id_product

SELECT * FROM credit_synth.customerinfo ci
JOIN credit_synth.products p ON p.id_product = ci.id_product

SELECT * FROM credit.creditline cl
JOIN credit.customercreditline ccl on ccl.id_credit_line = cl.id_credit_line
JOIN credit.products p on p.id_product = ccl.id_product)

SELECT * FROM credit_synth.creditline cl
JOIN credit_synth.customercreditline ccl on ccl.id_credit_line = cl.id_credit_line
JOIN credit_synth.products p on p.id_product = ccl.id_product

df_vle_orig = dfs_orig[‘customerinfo’].merge(dfs_orig[‘customerdocuments’], on=’id_customer’, how=’left’)
df_vle_orig.head(5)