SYNTHESIZED DATA QUALITY REPORT

Generating Data and Maintaining Referential Integrity with the DB Synthesizer

You can also download this report
Download as PDF
SUMMARY
The DB Synthesizer is Synthesized’s tool for synthesizing data from one database to another. This is a deep dive where we’ll look into the results of the synthesization including the quality of the generated data and how referential integrity is maintained.

Getting started

So let’s dive in. Let’s start with some questions around how the DB Synthesizer works:

What information is preserved in the DB Synthesizer?

  • Tables and column names. The schema of all tables and columns will be copied from the original source
  • Data types. All columns in the destination database will have the same data type as in the source database
  • DDL. The DDL for both databases (including constraints, procedures, views, sequences, etc.) will be present in the Synthesized database as they were in the source database
  • Referential Integrity. Primary and Foreign keys will be copied, ensuring referential integrity is preserved so that the user can query data with join statements and obtain similar results. A configuration file with additional relationships can be provided
  • Key Cardinality. Foreign key distributions are generated as close as possible to the source database to ensure similar cardinality
  • Column marginal distributions. Column marginal distributions are approximated and sampled from probability distributions to be similar to those in the original database

What information is the DB Synthesizer extracting from the original database? What information is put into the destination database?

All the structural data and data values will be almost exactly as they are in the original database. Foreign key distributions, which refer to points such as referential integrity and cardinality, will be approximated from the original database to serve data to their destination database.

This means that the destination database is going to have a similar to, but not exact distribution of referential integrity and cardinality as the original database.

The column probability density function, this density distribution is also going to be approximated. So the distributions are going to look quite similar to the original data, but they are not going to be exactly the same data. This results in generated data that is private.

Using the DB Synthesizer to generate new data

Now that you have an understanding of how the DB Synthesizer works, let’s look into an example.

The original database used in this report is from a credit system, and contains historical information about users, and what credit lines, products and campaigns they participated in, among some other information. This database has seven tables along with several primary and foreign keys.

Table sizes in this database go from 10 to 1M rows, and while for the purpose of this report a database with only 7 tables was chosen, the DB Synthesizer is scalable and is able to work with databases of virtually any size (the software has been tested and validated against production size databases with thousands of tables).

The DB Synthesizer can be run using a command line interface, which is going to run the java application this simple command:

$ java -jar target/db-synthesizer.jar \
   -u “jdbc:postgresql://39.176.361.124:1234” \
   -d credit \
   -a synthesize \
   -U system \
   -W password123 \
   -l “jdbc:postgresql://39.176.361.1:1234” \
   -s credit_synth

And what you can see here is the origin URL, the name of the database of the origin, and down here, the destination connection, URL and destination database name, and the other parameters that can be actioned.

Here we’ll perform an action such as copying or synthesizing the data. We’ll configure the DB Synthesizer here by adding information about the source database along with user and password information. You can configure these and other parameters.

More information can be found here: docs.synthesized.io

You can see all the other configuration elements that you can, that you can run. You can also run the DB Synthesizer from an API. So you could see it. You can find more information about how to run it here. There are other parameters that can be added to the DB Synthesizer, more detailed information can be found in the configuration section in the documentation.

All of these parameters can be added for extra and more personalized output. So all the information such as primary keys column types, et cetera, will be extracted from the database. But there are some cases where some information needs to be provided manually to the Synthesizer. And with this configuration file, we are able to add that information to the database.

Now that the DB Synthesizer has been run for this database, let’s look at the results. To do so, we’ll connect the original and destination databases and compare the results from both databases. Firstly, let’s confirm the tables in the synthetic database (destination) are the same as the original database.

So we have seven tables, which is the same number as in the original database.

Table Name
Original Shape
Synthesized Shape
products
(22, 4)
(22, 4)
customerregistration
(32593, 4)
(32593, 4)
creditline
(206, 5)
(206, 5)
documents
(6364, 3)
(6364, 3)
customercreditline
(173912, 5)
(173912, 5)
customerdocuments
(1000000, 5)
(1000000, 5)
customerinfo
(28785, 10)
(28785, 10)
select * from credit.customerinfo limit 5;
0
id_
customer
gender
region
highest_
education
imd_band
age_band
num-of-prev_
credit_lines
opened_
lines
opened_
lines
id_
product
0
11391
M
East Anglian Region
HE Qualification
90-100%
55<==
0
2
N
0
1
28400
F
Scotland
HE Qualification
20-30%
35-55
0
0
N
0
2
30268
F
Noth Western Region
A Level or Equivalent
30-40%
35-55
0
0
Y
0
3
31604
F
South East Region
A Level or Equivalent
50-60%
35-55
0
0
N
0
4
32885
F
West Midlands Region
Lower Than A Level
50-60%
0-55
0
0
N
0
select * from credit_synth.customerinfo limit 5;
0
id_
customer
gender
region
highest_
education
imd_band
age_band
num-of-prev_
credit_lines
opened_
lines
opened_
lines
id_
product
0
0
F
South Region
A Level or Equivalent
0-10%
0-35
0
0
N
14
1
1
M
East Midlands Region
HE Qualification
0-10%
35-55
0
1
N
7
2
2
M
South Region
Lower Than A Level
80-90%
0-35
0
0
N
7
3
3
F
North Western Region
Lower Than A Level
80-90%
0-35
0
0
N
8
4
4
M
West Midlands Region
A Level or Equivalent
10-20%
0-35
0
1
Y
4

The next thing we’ll look at is referential integrity. Based on the first five rows of this new table, we’re able to see that the data looks very similar to the original database. And then if we look at the shapes of the data in the tables, you’ll see that the shapes and the number of rows and columns of the original and Synthesized tables are the same.

SELECT * FROM credit.customerinfo
WHERE id_customer = 11391
0
id_credit_line
id_customer
date_start
is_default
min_credit_score
0
1752
11391
18
N
78.0
501
1753
11391
53
N
85.0
842
1754
11391
115
N
80.0
1032
1755
11391
164
N
85.0
1457
1756
11391
212
N
82.0

Now, if we do the same exercise and pick one random customer in the synthetic data, for example customer ID 265, this is how the row looks.

SELECT * FROM credit_synth.customerinfo
WHERE id_customer = 265
000
id_
customer
gender
region
highest_
education
imd_band
age_band
num-of-prev_
credit_lines
opened_
lines
dis-ability
id_
product
265
265
F
South Region
A Level or Equivalent
20-30%
0-35
0
0
N
17

And then we can go to the customer credit line for the same user in the synthetic database and we’ll see that this one has five rows as well.

SELECT * FROM credit_synth.customerinfo
WHERE id_customer = 265
0
id_credit_line
id_customer
date_start
is_default
min_credit_score
68624
96
265
106
N
89.371582
501
130
265
42
N
62.724920
842
168
265
5
N
62.297032
1032
16
265
70
N
64.593113

This is a quick example of how referential integrity is preserved. Furthermore, if we join these tables on the customer ID parameter in both the original and synthetic databases, we’ll see that there are some distributions that are preserved even after joining both tables.

For example, if we look at the default rate, we can see that it’s the same for the original database. This also applies to other fields like the minimum credit score. It’s been approximated quite nicely.

0
Original counts
Synthesizer counts
N
0.92989
0.92985
Y
0.07011
0.07015

Let’s do the same exercise for the other tables. If we look at customer information, we’ll look at this customer to see how it looks and this customer only has one ID product 15.

Which means it only exists once.

SELECT * FROM credit.customerinfo
WHERE id_customer = 11391
00000
id_
customer
gender
region
highest_
education
imd_band
age_band
num-of-prev_
credit_lines
opened_
lines
dis-ability
id_
product
19542
35860
M
South West Region
A Level or Equivalent
10-20
0-35
0
0
N
15

• Original products for id_customer = 11391, that has id_product = 15

SELECT * FROM credit.products
WHERE id_product = 15
0
product_code
campaign_code
campaign_length
id_product
17
FFF
2013B
240
15

While we thought we were looking at a one to many relationship, in this case, we are actually looking at a one-to-one relationship. And you can see that in the synthetic database, it’s also a one-to-one relationship. There’s only one product with ID 14, so the database was able to keep that structural relationship.

• Synthesized customer info for id_customer = 265

SELECT * FROM credit_synth.customerinfo
WHERE id_customer = 265
0000 00
id_
customer
gender
region
highest_
education
imd_band
age_band
num-of-prev_
credit_lines
opened_
lines
disability
id_
product
1739
1739
M
South West Region
A Level or Equivalent
10-20
0-35
0
0
N
14

• Synthesized customercreditline for id_customer = 265

SELECT * FROM credit_synth.products
WHERE id_product = 14
0
product_code
campaign_code
campaign_length
id_product
14
BBB
2013J
250
14

And if we join these tables again, you can see in the first five rows there are no duplicates.

SELECT * FROM credit.customerinfo ci
JOIN credit.products p ON p.id_product = ci.id_product
0
id_
customer
gender
region
highest_
education
imd_band
age_band
num-of-prev_
credit_lines
opened_
lines
disability
id_
product
product_code
campaign_
code
campaign_
length
0
11391
M
East Anglian Region
HE
Qualification
90-100%
55<==
0
2
N
0
AAA
2013J
268
1
28400
F
Scotland
HE
Qualification
20-30%
35-55
0
0
N
0
AAA
2013J
268
2
30268
F
North Western Region
A Level or Equivalent
30-40%
35-35
0
0
Y
0
AAA
2013J
268
3
31604
F
South East Region
A Level or Equivalent
50-60%
35-55
0
0
N
0
AAA
2013J
268
4
32885
F
West Midlands Region
Lower Than A Level
50-60%
0-35
0
0
N
0
AAA
2013J
268

And again, here, there are no duplicates. So the referential integrity of this structure was kept.

SELECT * FROM credit_synth.customerinfo ci
JOIN credit_synth.products p ON p.id_product = ci.id_product
0
id_
customer
gender
region
highest_
education
imd_band
age_band
num-of-prev_
credit_lines
opened_
lines
disability
id_
product
product_code
campaign_
code
campaign_
length
0
0
M
South Region
A Level or Equivalent
0-10%
0-35
0
0
N
14
EEE
2014J
269
1
54
F
West Midlands Region
Lower Than A Level
20-30%
0-35
0
1
N
14
EEE
2014J
269
2
64
M
South East Region
Lower Than A Level
30-40%
0-35
0
0
N
14
EEE
2014J
269
3
74
F
Wales
A Level or Equivalent
80-90%
0-35
0
0
N
14
EEE
2014J
269
4
76
M
West Midlands Region
Lower Than A Level
60-70%
0-35
0
0
N
14
EEE
2014J
269

Let’s continue and plot the heat map for these two columns, the age band and campaign code in the original and synthesized tables. After joining these tables, it’s important to highlight that the age band is in one of the tables while the campaign code is in the other table. So age band is in the customer information table, while the campaign code is in the products table. You can see that even after joining both tables, the heat maps are quite similar.

We can plot other things like the campaign length against age bands, and the distributions are very similar on the campaign length.

The distributions are also quite similar when looking at campaign codes and product codes. So you can see how the distributions are kept.

categorical_distribution_plot(df_mdl_orig[‘campaign_code’], df_mdl_synth[‘campaign_code’], title=’campaign_code’)
categorical_distribution_plot(df_mdl_orig[‘product_code’], df_mdl_synth[‘product_code’], title=’product_code’)

Let’s go a step further and join the three tables. So we’ll join credit line with customer credit and customer credit line via this ID credit line. And then we also join credit products on ID product, and we then apply the same query to the synthetic database and plot the distribution.

SELECT * FROM credit.creditline cl
JOIN credit.customercreditline ccl on ccl.id_credit_line = cl.id_credit_line
JOIN credit.products p on p.id_product = ccl.id_product)
SELECT * FROM credit_synth.creditline cl
JOIN credit_synth.customercreditline ccl on ccl.id_credit_line = cl.id_credit_line
JOIN credit_synth.products p on p.id_product = ccl.id_product
categorical_distribution_plot(df_assmnt_orig[‘campaign_code’], df_assmnt_synth[‘campaign_code’], title=’campaign_code’)
categorical_distribution_plot(df_assmnt_orig[‘product_code’], df_assmnt_synth[‘product_code’], title=’product_code’)

You can see that even after joining three tables, the distributions are kept quite nicely. So you can see it for the campaign code and on the product code. And if we look at customers per product, we’ll see the same results.

df_vle_orig = dfs_orig[‘customerinfo’].merge(dfs_orig[‘customerdocuments’], on=’id_customer’, how=’left’)
df_vle_orig.head(5)
0
id_
customer
gender
region
highest_
education
imd_band
age_band
num-of-prev_
credit_lines
opened_
lines
disability
id_
product_x
id_document
date
sum_click
id_
product_y
0
11391
M
East Anglian Region
HE Qualification
90-100%
55<==
0
2
N
0
546669.0
-5.0
16.0
0.0
1
11391
M
East Anglian Region
HE Qualification
90-100%
55<==
0
2
N
0
546662.0
-5.0
44.0
0.0
2
11391
M
East Anglian Region
HE Qualification
90-100%
55<==
0
2
N
0
546652.0
-5.0
1.0
0.0
3
11391
M
East Anglian Region
HE Qualification
90-100%
55<==
0
2
N
0
546668.0
-5.0
2.0
0.0
4
11391
M
East Anglian Region
HE Qualification
90-100%
55<==
0
2
N
0
546652.0
-5.0
1.0
0.0

Now let’s take a look at the foreign keys. We want to make sure that we preserve the value counts when we’re talking about foreign keys. For example, in the case where we have the customer and customer transactions we would want to make sure that the number of transactions per customer is similar to ensure referential integrity is preserved.

This is known as cardinality which is plotted above. For example, here, all values are unique in the both original and synthetic databases. You can see that this distribution is quite accurate. Most of them only have one occurrence when ID customer number is one. There are 8,000 that only have one occurrence. So the majority are biased for one occurrence, but most of them have multiple occurrences. And you can see a similar situation where referential integrity is approximated quite nicely.

Last but not least, we also need to look at column distribution. If we look at one column, particularly will the distribution be preserved. And here you can see the submissions of the single column for the table customer information.

You can see that continuous and categorical values, are all preserved quite nicely. Obviously the primary key is not preserved because we generated new primary keys to avoid privacy leakage. So the primary keys are generated independently from the, from the original values. And you can see it.

The rest of the values are fairly similar. And similarly, if we look at the customer credit line table again and specifically the ID customer line, and ID customer are not approximated nicely because we generated new values for this, because otherwise these can be identifiers and the database could be attacked with these values, but for the rest of our values you can observe the distributions.

Conclusions

In this report we demonstrated how the DB Synthesizer is able to explore a given database, extract and learn all the information contained within it, and generate an entirely new database whilst maintaining the referential integrity necessary to ensure the proper linkage between multiple tables within the database.

Please feel free to reach out with any questions or if you’d like to learn more to: letschat@syntheszed.io