While there has been a lot of interest in synthetic data recently, we encountered different views on what type of data “synthetic data” actually refers to. In the context of anonymisation, “synthetic” is frequently used interchangeably with “modified”, that is, original data altered in some systematic way to make it more difficult to identify original data points. We distinguish between three approaches to synthetic data currently available on the market:
In the scientific community, simulated data sometimes refers to either of the latter two, that is, the output of a generative model of some sort. We distinguish between data originating from a generative model with an explicit probabilistic structure (often hand-built) as artificial, and data from a generative model learned from data via unsupervised machine learning as synthetic. Real-world data is often ultra high-dimensional, with complex correlations, and following non-standard distributions, which conflicts with the limiting assumptions underlying many explicit probabilistic models. However, recent breakthroughs in machine learning make learning such complex distributions purely from data possible.
As we have seen, anonymised data is obtained by a 1-to-1 mapping from original data which, by its nature, has fundamental weaknesses. At its heart lies the trade-off between preserving as much useful information as possible, while simultaneously removing as much revealing details as necessary. This means that anonymisation techniques inherently degrade data quality at the cost of preventing de-anonymisation attacks. Other approaches do not face the same problem, since generative models by definition abstract from individual data points and instead learn aggregate tendencies.
It is thus not surprising that there have been a number of data breaches in the past, despite the use of anonymised data, like the attack on the anonymised Netflix Prize Dataset. Such breaches can have horrendous implications on a company’s reputation, as illustrated by the Anthem, Inc. and the JPMorgan Chase case. Researchers have identified various methods of de-anonymisation attacks on anonymised data. Some recent examples include:
What unites these incidents is that anonymisation procedures are used carelessly as a panacea for any privacy-related concerns about data sharing, without understanding its fundamental limitations. However, with the advent of new regulations such as GDPR and increasing public attention to privacy issues, the costs of such accidental slips have increased dramatically, putting additional pressure on data sharing activities within and between companies.
There is a number of open-source data anonymisation frameworks available, most notably including ARX. ARX offers various obfuscating data transformations and analysis methods such as k-anonymisation. Note that perfect anonymisation in itself is not a difficult task, but can be achieved easily via heavy scrambling and noising of original data. The challenge is consequently: how to make anonymised or simulated data useful? As we have argued above, the usefulness of anonymised data is directly related to how close the 1-to-1 mapping can be chosen given certain privacy requirements. In the case of simulated data, however, even the definition of usefulness is far from obvious, which will be the focus of our next blogpost.
Next research post: How do we Define Usefulness of Simulated Data?