Automated preserevation of key properties within one database
Preserving properties between databases
Re-usable policies for data transormations
Enterprise data masking and obfuscation
Enterprise privacy-preserving data generation
Support & services
Bespoke end user training
Manuals, guides and refreshed materials
SLA - M-F 9-5 or 24/7
Which data sources is Synthesized capable of connecting to for the purpose of creating test/training data?
Available immediately: Oracle, Postgres, DB2, Sybase, SQL Server, MySQL. Sources enable upon request: Sharepoint / File Shares / SMB, Drill, Druid, Hive, Solr, CockroachDB, CrateDB, Exasol, Elasticsearch, Firebird, BigQuery, Google Sheets, Informix, Netezza, MonetDB, ASE, Hana, Snowflake, Teradata. Development of new data connectors is straightforward and we would be happy to work with you on any specific requirements.
For structured data, is Synthesized capable of reading the data schema for the physical structure of the data to be understood?
Synthesized is capable of reading the data schema. Synthesized understands the underlying data model so that different protection techniques can be applied at an attribute level. Users can also provide additional information: either annotating the data with information about whether a given field is of a particular format (e.g: addresses, names) or providing rules about the structure of the data itself (e.g: column A > column B). Available immediately: Database Schemas/DDL, JSON Schemas.
Can Synthesized acquire and protect data from multiple data sources, i.e. multiple tables and databases? When creating cleansed versions of these tables / databases, will referential integrity be preserved?
Yes, Synthesized is able to acquire and protect data from multiple data sources including multiple tables and databases and can be integrated into ETL pipelines. In creating test/training data, referential integrity is preserved. This is also true if sensitive data is used for references/foreign keys.
Does Synthesized support attribute-level data anonymization - where information relating to a data subject (e.g. a clients name) is removed, thereby eliminating the possibility of identifying the data subject?
Yes, Synthesized supports two ways of attribute-level anonymisation.
Partial masking. Values can be partially (or totally) be substituted by a placeholder character, "x" by default. For example, the value "4905 9328 9320 4630" would be replaced by "xxxx xxxx xxxx 4630".
Nulling. The contents of a column can be completely removed, and the output dataset would contain an empty column.
Swapping. The output column contains the same unique values as the input one, but they are randomly shuffled so that correlations with other columns are completely lost.
Random strings. Generate random strings with similar format to input values, for example "490GH830L" could be transformed into "L3N8O3H2M".
Generalization. Individual values of attributes are replaced with a broader category. For example, the value '19' of the attribute 'Age' may be replaced by ' ≤ 20', the value '23' by '20 < Age ≤ 30' , etc.
Fake data.Synthesized also supports the substitution of production values with realistic "fake" data and continuity can be maintained across attributes in a row using the Synthesized annotation feature (e.g. a real name will be replaced by a "fake" name and this same "fake" name can be used to create a "fake" email address etc). Generated "fake" data is coherent across columns.
Does Synthesized ensure referential integrity when it replaces sensitive values with "fake" equivalents?
Yes, Synthesized will ensure referential integrity is preserved when generating new data. In addition,Synthesized handles referential integrity across multiple data sources. Circular references may require additional configuration but we have handled these successfully with several customers.
How is data security defined for derived data compared to original data points?
Synthesized is highly secure and we continuously work to ensure we adopt the latest security protocols and techniques. Security at the platform level applies universally to original and Synthesized data points and includes:
We support the latest security protocols such as JWT tokenization, SSL and Bcrypt cryptography to keep data secured
Access to both sensitive original data and Synthesized data is controlled through role based administration including functional privileges, such as whom can modify or edit data sets, and sharing privileges, such as whom can see the original data attributes or the Synthesized ones
Single sign on integration with SAML 2.0, OpenID and Active directory
The derived data can then be exported to enterprise databases for consumption and usage, at which point the data does not contain any sensitive data nor links to the databases or connectors with the original data.
When using another tenant’s data, is access control/approval managed by the original data owner or mirrored
Access control and approval can be managed both by the original Data Product Owner or in a mirrored fashion where the owner can set up a mirrored environment for any user.
Language support: What are the main languages supported by the platform? e.g. Python, Spark, Hive SQL, …
There are three ways to interact with Synthesized engine:
The core SDK is a Python package, which can easily be integrated into any Python pipeline
The core TDK is a Java package
The engines can run in a Docker container so the user can communicate via API, supporting any language with network access. ETL integration is also straightforward in this case
Additionally, we offer a web interface for more interactive and user-friendly communication with the engines.
What are the logical data models vs the physical model, e.g. tables, files, etc. ?
The lineage plot includes the following types of nodes:
The plot is a nested structure of physical and logical models connected with edges representing flows of data. We are happy to provide any additional detail.
Describe expected performance estimates for the following scenarios, assuming a reasonable component deployment:
The lineage plot includes the following types of nodes:
Small & Connected: Single Table, Several attributes that are dependent and must be processed consistently, 30 columns, 100k recordsAnalysis: < 5 minutesSynthesis < 1 minute
Medium & Simple: 10 tables with no circular references, references only by primary key (normalized), 20 columns each table, 50k records per tableAnalysis: < 20 minutesSynthesis < 1 minutes
Medium & Complex:30 tables with circular references, references only by primary key (normalized), 20 columns per table, 100k records per table.Supported by a separate moduleAnalysis < 1hSynthesis < 10min
Large & Simple :Single table, 40 columns, 1TB of dataAnalysis: < 3 hrSynthesis < 1 hr
Large & Complex: 40 tables with circular references, multiple dependencies between tables (data copied between tables), 30 columns per table, 1m recordsSupported by a separate module onlyAnalysis: < 3 hrSynthesis < 1 hr
Very Large & Complex:Full Data Warehouse, 100's of tables, lots of interdependent data, 10TB of dataSupported by a separate module onlyAnalysis: < 24 hrSynthesis < 4 hr
Join our DataOps community on Slack
Learn about modern DataOps practices and connect directly with your peers, Synthesized users, and our engineers.