For many years, humanity has strived for more accurate depictions in painting. Thousands of years have passed from primitive rock art to masterpieces of the Renaissance and Enlightenment. However, around the 19th century, painting changed its direction. It shifted from portraiture and realism back to abstract representations of nature, people, and other everyday things. Furthermore, painting has moved towards blots, stains, smudges, and other non-representational forms smeared across the canvas or other surfaces, that are now worth millions of dollars.
Similar processes are currently taking place in the world of data. Synthesized, generated, and abstract data are gaining increasing value in the market. In this post, we will cover the main points of data generation from scratch and based on existing values. Since SQL has been the primary language for data for over 50 years, each of our steps will be supported by real examples using SQL-queries.
It is common to encounter situations where a data schema exists in our database, but there is currently no data in it. This can make it difficult to test our apps (including functional, integration, load, and acceptance testing) and to gain a general understanding of its purpose. Therefore, it is good practice to generate some data, whether just a little or a lot, before our application is put into production.
And of course, in the current climate, we have no choice but to start with ChatGPT. We will politely ask the bot to generate data for the Pagila (publicly available sample database schema):
As a result, we will obtain several valid SQL scripts in the correct order. However, we may need to request additional details, such as ensuring that all tables are included. On the one hand, this is useful and may already suffice for some cases. On the other hand, there are numerous nuances during data generation, such as specific data distribution and proximity to the subject area. Explaining all of these nuances to the bot may be challenging and labor-intensive. Plus, for sure we will need to generate a large amount of data, and in a very limited time and for private corporate schemas… So let’s roll up our sleeves and go through all the basic steps of generating data from scratch using good old SQL.
As we know, SQL was designed for working with real data stored in tables. However, the SQL:1999 standard introduced recursive queries, which allow, among other things, to generate an arbitrary number of rows without referring to any particular table. At the same time, different DBMSs may have their own (often more convenient) constructs for generating rows:
So, we already have rows, and now we need data for these rows. To the best of my knowledge, SQL standard does not provide a way to generate random values. However, most DBMSs have their own methods for doing so. With a little tinkering, we can obtain some random data that is remotely similar to real names, emails, dates, and so on. Let’s create 1000 employees for the employee table without leaving your warm SQL-console:
Thus, we can already generate tons of messy data, which in many cases will be quite enough. But we will not rest on our laurels and will try to generate something closer to real data. Let’s start with one of the most popular task: generating real people’s names. One simple and effective solution is to prepare two sets with common first and last names, respectively, and join them using a Cartesian join. I was impressed by this GitHub gist:
We were able to generate 1392 unique full names by joining 48 first names and 29 last names. This is a good start, and the same technique can be applied to generate other types of data, such as emails, addresses and so on, by wrapping it in a stored function or using a templating engine like Jinja for ease of use. We can also use numerous third-party advanced random data generation services and try to load the generated data into our database (e.g., in CSV format like Fake Name Generator). Some of them will even be able to generate a ready-made SQL script for you (like the Generatedata service):
The problem of generating data is not new, and there are many popular general-purpose libraries available for generating high-quality fake primitives such as names, addresses, and companies for various programming languages (e.g. Java, Python, JS, Ruby, etc.). Fortunately, some databases allow you to work with these libraries and generate more realistic data using SQL queries. For example, PostgreSQL Faker allows you to generate more realistic data using SQL queries like this:
And that’s not all. Extensions such as faker_fdw provide a true relational way to generate data, using tables, joins, and other relational features:
Although our data is random, this does not mean that there are no requirements for it. For example, we may only need unique data for certain columns (e.g. id, isbn, code, etc.). There are several ways to achieve this. For instance, many DBMSs support the upsert concept based on the values of one or more columns (clauses like merge, on conflict, etc.):
We can also eliminate duplicate values during generation with the help of the distinct clause or analytical functions:
The disadvantage of the two previous solutions is that we may end up with fewer rows than specified. For greater accuracy, we can use the unique data generation tools built into the DBMS, such as sequences, generators, UUIDs, etc.:
In specific cases, it may be necessary to generate the same random data on every run. This is particularly useful for running tests. To achieve this, many DBMSs and libraries allow you to set the initial value (seed) of the random generator:
We may need to generate not only textual data, but also images, such as avatars. There are many special services with APIs for avatar generation, ranging from funny cartoons to real people photos (e.g. dicebear.com, api.multiavatar.com, randomusers). Let’s try to generate some people with random avatars using the captivating robohash.org service:
It’s time to do something useful! To conduct our experiments, we require a “guinea pig” database. Let’s use the wonderful and well-known Pagila sample database, which we already used with ChatGPT at the very beginning of the post. In fact, Pagila already has data, but it is quite small — a maximum of 16k in just a couple of tables. So we will generate a lot of data ourselves for the empty schema with the help of pure SQL:
To make our SQL generation scripts simpler and more readable, and the generated data more realistic, we will use the PostgreSQL Faker Postgres extension in our SQL queries. Fortunately, there is an easy way to obtain this extension by using a Docker image:
Also, don’t forget to connect to the database and register the extension:
And set up Pagila’s schema, using only the pagila-schema.sql script without any data:
So let’s start with the easiest stuff — reference tables. In our case, these are things like tables of countries, languages, and so on. In such a situation, we can combine the generate_series and faker.unique_country functions (or unique_language_name, or any other suitable function from the PostgreSQL Faker extension):
A more complex case is the city reference table because the city table depends on the country table. It would be great if each country had a different number of cities, as is usually the case in reality. To achieve this, we can perform a cross join between the country table and a sequence of 1000 rows, and then randomly filter out some of the data, 90% in our case:
As a result, we get around 2000 cities with different distributions by country:
Okay, now comes the fun part — generating data for the staff table. This table has two parent tables, store and address, and we need to generate their random combinations somehow. The first thing that comes to mind is using a cross join (as we did already with small reference tables). However, in this case, we may end up with a very slow query, because the Cartesian product of two large tables will generate a huge number of rows that still have to be sorted in random order (and only after that can we cut off the extra rows). Fortunately, the SQL:2003 standard introduces the tablesample clause. This allows us to read not the entire table, but only a part of it as a percentage. The Bernoulli sampling method ensures that all blocks of the table are scanned and only some random records are read, which leads to a quite uniform distribution of randomly selected rows. This way, we can subtract only a portion of the random rows from the store and address tables, say 1%, and then confidently perform a cross join between them:
We often need to store time-series data in our tables. TimescaleDB has published three impressive articles (one, two, three) devoted to generating such data using SQL. However, in this post, we will only focus on generating rental_date values based on the current row number for the rental table:
In some cases, we need to generate data from scratch, which is what we have done so far. But in other cases, we need to generate new data based on existing data while preserving current distributions and relationships. Let’s try to increase the number of cities in the city table by four times while maintaining the percentage of cities in each country. The idea is very simple: count the number of cities for each country, multiply it by 3, and generate this number of new cities for each country:
As you can see, there are many more cities, but the top 5 countries have remained the same:
This is a viable option for generating data based on existing data, but in some cases, we may need more advanced methods, such as interpolation, prediction/extrapolation, or even machine learning. Perhaps we will discuss this separately in one of the following posts.
The full set of scripts is available in this GitHub repository. Among other things, you can find a docker-compose.yaml file there, which will allow you to easily set up a test database and run generation scripts for it with just a few commands:
On the one hand, we have described many useful tools and generated a lot of synthetic data for our schema. But on the other hand, we will face many other problems and challenges:
Fortunately, there are many ready-made solutions that can solve the aforementioned problems in various ways (an incomplete list can be found and supplemented here). We will attempt to generate data for our schema using the Synthesized TDK tool. TDK is a YAML-based tool, and for a quick start, we only need to prepare a small configuration file in YAML format and specify the mode and expected number of rows:
This configuration is already sufficient to generate test data for both a small Pagila database and for real production schemas with hundreds of tables and relationships. TDK will determine and apply the necessary generation parameters independently depending on the mode, type of columns, and initial data. As a result, TDK will scan all the tables and their data in the input database and generate 100K rows for each table in the output database, taking into account the type of each column and the existing distribution.
However, we can change the default behavior of TDK by adding custom instructions to the configuration file. Suppose we want movies that are suitable for the whole family (MPAA-rating G — General Audiences) to be the majority. To achieve this, we will add additional configurations to our configuration file to generate the values of the film.rating column with a specific distribution:
By applying this configuration, we can achieve the specified distribution among 10,000 films:
We can also further configure the generation of addresses, names, strings, various sequences, and much more. You can read more about this in the documentation on transformation types. TDK also provide easy integration with CI/CD pipelines and Testcontainers. And for an even easier start, we have prepared a small demo — pagila-tdk-generation. It can be run with just a couple of commands using docker-compose:
As a result, two PostgreSQL instances will be created, with port 6000 forwarded for the input database and port 6001 for the output database. The Pagila schema will be installed on each of them, and the TDK will be launched to generate data using a more advanced configuration. Once the TDK completes its work, control will return to the command line, and we can connect to the output database to examine the synthesized data (using 6001 port and postgres as the username, password, and database name).
In this post, we have demonstrated how easy it is to generate test data for a database using SQL and its extensions. With these tools, we were able to create a set of SQL scripts that generate test data for the Pagila schema. You can use these scripts as a basis for creating your own scripts for your own databases. However, as your database schema grows and your data quality requirements become more complex, maintaining and developing these scripts can become difficult and expensive. Fortunately, there are many ready-made solutions available, one of which is Synthesized TDK, which we examined in this post.