FAQ - Synthesized

Infrastructure, deployment,
and integrations

Do you support on-premise and cloud data sources? What data sourcesare you able to connect to?

Synthesized is able to connect to both on-prem and cloud data sources.

On-premise available data sources:

Immediately: Oracle, Postgres, DB2, Sybase, SQL Server, MySQL
Sources enable upon request: Sharepoint / File Shares / SMB, Drill, Druid, Hive, Solr, CockroachDB, CrateDB, Exasol, Elasticsearch, Firebird, BigQuery, Google Sheets, Informix, Netezza, MonetDB, ASE, Hana, Snowflake, Teradata

Cloud-based available data sources:

Immediately: Azure Table Storage, Azure Database for PostgreSQL, Azure CosmosDB, Azure SQL Database
Sources enable upon request: Azure Blob Storage, Azure Files, Data Warehouse, O365 Sharepoint / OneDrive

Can I install the software on-premise? And on our private cloud?

Yes, Synthesized supports on-prem installation, and can be easily deployed into MS Azure, AWS, and GCP private cloud.

How do your users interact with the software? Can I plug it into our data pipelineand integrate it with our current CI/CD process?

One can interact with the Synthesized platform from the Web-UI, API, or SDK. When used from the Web-UI, the user will benefit from an easy to use and intuitive interface and enable team collaboration. The platform can be integrated with any external service using the API service, and it can be easily integrated with CI/CD processes with the SDK.

Does the product have a UI with access control and audit capabilities?

The solution supports user accounts and roles, and it can be integrated with an external single sign-on (SSO) service. It also provides full audit capabilities: all actions are written into a service table and can be queried using a REST API.

Synthesized data engine

What is synthetic data?

Synthetic data is data generated from a machine learning model that looks at the original data, learns and understands it, and is able to generate more data that looks, feels, tastes and even smells like original data.The new data has the same high level properties as the original, but at the row level it’s completely new and artificial.

What can I use synthetic data for? What are the benefits?

Original data can be substituted for Synthetic data, and the same results will be achieved.But it has many other benefits, to name a few:

As it’s synthetic, it is privacy-compliant and can be used and shared freely without going through complex and long compliance processes.
As it can be manipulated and reshaped for different purposes, for example data rebalancing, bias mitigation, data imputation, or model validation with simulated data scenarios.
For machine learning and to optimize data coverage for testing.
Data generation is generally very fast, so one can generate large amounts of data for data augmentation and performance testing.
As it’s based on a model, storing the model parameters is much more lightweight than storing the actual data. Then, the data can be generated on-demand when needed.

What data types are handled by the Synthesized platform?

Synthesized can work with all structured data, usually — but not exclusively — this refers to tabular data, including flat files (such as Excel spreadsheets, CSVs, etc) and relational databases. All usual data types (integers, floats, characters, UUIDs, JSON) are handled by the platform.

How can I compare my original data to my synthetic data?

For each Synthesized data product generated, Synthesized can automatically generate both a data utility and data privacy report that can also be stored, versioned, and documented.

Furthermore, different privacy reports and monitoring features are available on-demand, with an alert system that would notify the users if a specific scenario happens. This can be applied to privacy.

How complex is the platform to use? Do I need extensive knowledge and trainingto be able to use it?

Running the engine against a given data source with the default configuration is straightforward, the user just needs to provide connection details for input and output sources, and that’s all!

If needed, the user can still provide some extra configuration parameters to enforce certain behaviour (e.g. strict rules or implicit referential integrity).

Privacy and de-identification

Why is synthetic data more secure and privacy-preserving than traditional anonymization techniques?

With traditional anonymization techniques each sample in the output data will have a one-to-one mapping with the original set, which means that they are not robust against complex attacks, such as linkage attack or attribute inference.

Synthesized approach, on the other hand, is to learn the data distribution of the underlying phenomena and sample new data points from it. This means that there’s no one-to-one mapping with the original data which makes it robust against complex disclosure attacks. Read more about the topic here.

Does the solution have an option to guarantee that no synthetic data points coincide with any real data point?

Generally speaking, the chances of generating a data point that is present in the original set are insignificant. But if that happens, the Synthesizer can be set to remove those points from the output set.

Does your solution have a means to generate synthetic data with or without differential privacy depending on preference?

Yes! Synthesized solution allows you to define and configure:

Differential privacy parameters (such as “epsilon” and “nose_multiplier”) can be configured depending on the user’s needs. More information here.
Linkage attack “t-closeness” and “k-distance” parameters can be configured, and linkage attack mitigation can be applied to the data (on-demand feature).

How does Synthesized handle columns that contain PII, such as names and addresses? Is it possible to generate “fake” but realistic values in such cases?

Synthesized supports two ways of attribute-level anonymization:

Data obfuscation: if required, the user can obfuscate any data field with traditional anonymization techniques such as masking, nulling, and generalization among other available techniques.
Fake data: Synthesized also supports the substitution of production values with realistic “fake” data, while maintaining continuity across attributes in a row. Fake data keeps the anonymity of underlying PII data hidden while still generating realistic values for certain entities such as names, addresses, bank accounts, etc.

Questions?

Which data sources is Synthesized capable of connecting to for the purpose of creating test/training data?

Available immediately: Oracle, Postgres, DB2, Sybase, SQL Server, MySQL. Sources enable upon request: Sharepoint / File Shares / SMB, Drill, Druid, Hive, Solr, CockroachDB, CrateDB, Exasol, Elasticsearch, Firebird, BigQuery, Google Sheets, Informix, Netezza, MonetDB, ASE, Hana, Snowflake, Teradata. Development of new data connectors is straightforward and we would be happy to work with you on any specific requirements.

For structured data, is Synthesized capable of reading the data schema for the physical structure of the data to be understood?

Synthesized is capable of reading the data schema. Synthesized understands the underlying data model so that different protection techniques can be applied at an attribute level. Users can also provide additional information: either annotating the data with information about whether a given field is of a particular format (e.g: addresses, names) or providing rules about the structure of the data itself (e.g: column A > column B).
Available immediately: Database Schemas/DDL, JSON Schemas.

Can Synthesized acquire and protect data from multiple data sources, i.e. multiple tables and databases? When creating cleansed versions of these tables / databases, will referential integrity be preserved?

Yes, Synthesized is able to acquire and protect data from multiple data sources including multiple tables and databases and can be integrated into ETL pipelines. In creating test/training data, referential integrity is preserved. This is also true if sensitive data is used for references/foreign keys.

Does Synthesized support attribute-level data anonymization - where information relating to a data subject (e.g. a clients name) is removed, thereby eliminating the possibility of identifying the data subject?

Yes, Synthesized supports two ways of attribute-level anonymisation.
‍

Data obfuscation.
- Partial masking. Values can be partially (or totally) be substituted by a placeholder character, "x" by default. For example, the value "4905 9328 9320 4630" would be replaced by "xxxx xxxx xxxx 4630".
- Nulling. The contents of a column can be completely removed, and the output dataset would contain an empty column.
- Swapping. The output column contains the same unique values as the input one, but they are randomly shuffled so that correlations with other columns are completely lost.
- Random strings. Generate random strings with similar format to input values, for example "490GH830L" could be transformed into "L3N8O3H2M".
- Generalization. Individual values of attributes are replaced with a broader category. For example, the value '19' of the attribute 'Age' may be replaced by ' ≤ 20', the value '23' by '20 < Age ≤ 30' , etc.
Fake data.Synthesized also supports the substitution of production values with realistic "fake" data and continuity can be maintained across attributes in a row using the Synthesized annotation feature (e.g. a real name will be replaced by a "fake" name and this same "fake" name can be used to create a "fake" email address etc).
Generated "fake" data is coherent across columns.

Does Synthesized ensure referential integrity when it replaces sensitive values with "fake" equivalents?

Yes, Synthesized will ensure referential integrity is preserved when generating new data. In addition, Synthesized handles referential integrity across multiple data sources. Circular references may require additional configuration but we have handled these successfully with several customers.

How is data security defined for derived data compared to original data points?

Synthesized is highly secure and we continuously work to ensure we adopt the latest security protocols and techniques. Security at the platform level applies universally to original and Synthesized data points and includes:

Technology
- We support the latest security protocols such as JWT tokenization, SSL and Bcrypt cryptography to keep data secured
Access Management
- Access to both sensitive original data and Synthesized data is controlled through role based administration including functional privileges, such as whom can modify or edit data sets, and sharing privileges, such as whom can see the original data attributes or the Synthesized ones
- Single sign on integration with SAML 2.0, OpenID and Active directory

The derived data can then be exported to enterprise databases for consumption and usage, at which point the data does not contain any sensitive data nor links to the databases or connectors with the original data.

When using another tenant’s data, is access control/approval managed by the original data owner or mirrored

Access control and approval can be managed both by the original Data Product Owner or in a mirrored fashion where the owner can set up a mirrored environment for any user.

Language support: what are the main languages supported by the platform? e.g. Python, Spark, Hive SQL, etc.

There are three ways to interact with Synthesized engine:

The core SDK is a Python package, which can easily be integrated into any Python pipeline
The core TDK is a Java package
The engines can run in a Docker container so the user can communicate via API, supporting any language with network access. ETL integration is also straightforward in this case
Additionally, we offer a web interface for more interactive and user-friendly communication with the engines.

What are the logical data models vs the physical model, e.g. tables, files, etc. ?

The lineage plot includes the following types of nodes:

Physical models:

Host
Database
Schema
Dataset
Transformation

Logical models:

Data Model
Data Entity

The plot is a nested structure of physical and logical models connected with edges representing flows of data. We are happy to provide any additional detail.

Describe expected performance estimates for the following scenarios, assuming a reasonable component deployment:

The lineage plot includes the following types of nodes:

Small & Simple:Single Table, Independent Attributes, 30 Columns, 100k recordsAnalysis: < 10 secondsSynthesis < 10 seconds
Small & Connected: Single Table, Several attributes that are dependent and must be processed consistently, 30 columns, 100k recordsAnalysis: < 5 minutesSynthesis < 1 minute
Medium & Simple: 10 tables with no circular references, references only by primary key (normalized), 20 columns each table, 50k records per tableAnalysis: < 20 minutesSynthesis < 1 minutes
Medium & Complex:30 tables with circular references, references only by primary key (normalized), 20 columns per table, 100k records per table.Supported by a separate moduleAnalysis < 1hSynthesis < 10min
Large & Simple :Single table, 40 columns, 1TB of dataAnalysis: < 3 hrSynthesis < 1 hr
Large & Complex:40 tables with circular references, multiple dependencies between tables (data copied between tables), 30 columns per table, 1m recordsSupported by a separate module onlyAnalysis: < 3 hrSynthesis < 1 hr
Very Large & Complex:Full Data Warehouse, 100's of tables, lots of interdependent data, 10TB of dataSupported by a separate module onlyAnalysis: < 24 hrSynthesis < 4 hr

Frequent questions
and answers

Infrastructure, deployment,
and integrations

Synthesized data engine

Privacy and de-identification

Questions?

If you would like a demo about our platform capabilities or would like to try it for free, please get in touch.

Frequent questionsand answers

Infrastructure, deployment,and integrations

Synthesized data engine

Privacy and de-identification

Questions?

If you would like a demo about our platform capabilities or would like to try it for free, please get in touch.

Frequent questions
and answers

Infrastructure, deployment,
and integrations