As you know, Baruch Sadogursky, Head of DevOps Advocacy at JFrog, recently joined our very own CTO Denis Borovikov and Engineering Lead, Seva Brekelov to discuss the impact of DevOps on software development, how DataOps can transform how companies work with data and how closely DevOps and DataOps are related.
This is the second post in a series summarizing their conversation and key takeaways. In today’s article, they address the two main bottlenecks in the current data process and how to work around them. To catch the full podcast, click here!
Mentioning the long list of specialties needed in a data project, it might look like it's difficult to create a DataOps platform. It seems as if it is too much of everything, but let us start by looking at the two main bottlenecks in the current data process.
The first one is data engineering because it's the stage that requires most of the coding; from actual groundwork, building pipeline, testing, to checking the data quality. If you can automate that or improve it, you see large improvements instantly.
The second is compliance. If your data is sensitive, you have to deal with compliance.
The basic level of dealing with it is masking credit card numbers etc. But that's not enough. But what happens when you might want to use that data? Or when sensitive data is not always obvious.
If the problem of sensitive data is solved, and the data is made nonsensitive, this solves the compliance issues. Yet it requires answering questions such as, how was this anonymization problem solved?
Some believe that removing identifiers means anonymizing the data, and this is not true. For example, people would use a masking script on a database and think that they anonymized data, but that is in fact de-identification.
Another example is, let’s imagine a salary column, and there is a person with a specific salary. You can remove all identifiers, but if their salary is known, it would be easy to find that person. Just from recognising values you can easily know who the data pertains to. Hence why advanced techniques such as l-diversity or k-anonymity, typically the standard processes, are long, manual and involve lots of errors.
And this is why the need for the new technology called synthetic data. It is the ultimate solution for the anonymization problem. It's completely automatic. It creates data from sampled sets. The data is completely new, meaning it does not contain data from the original individual in that data at all.
Firstly, you must ensure that the data that you're working with is relevant to what you used to have. That means consistently conducting quality checks on your synthetic data. Whenever you create a model of data and sample a new data set, you need to compare it with the original data. We provide a set of built-in checks, which can be extended on a case by case basis. If you have an anonymized synthetic view of data you have one less problem to solve. Data can be accessed and meets the requirements of compliance. So that's a solved challenge.
Traditionally with data engineering comes the need for the manual coding of all the data pipelines. This is where this DataOps platform becomes relevant. We are building an interface or platform where you can self-service in an efficient manner, and create data pipelines. When you have both an automated data pipeline and a good solution for anonymization, you run most of your projects faster.
In some cases, there is an input of some true data that might be not suitable, both in form and in content, because it's not structured the way the consumers expected and it's not anonymized. This may result in compliance issues. But there is a synthesized black box that generates new data based on the original one. And the output is that it has both data, which is randomly generated, but has all the traits of the original data set. And it is structured in a way that engineers and analysts can actually successfully consume it and do their job.
Synthesized solves the two big problems of the structure of the data and the anonymization of the data.
Let’s start with a comparison to the regular pipeline. In the DevOps world, all know there is a pipeline to develop and build the app, then the app needs to be tested to ensure the quality is satisfied, and then have it released to the users.
For DataOps, we have a more complex pipeline to develop the data product.The data product is something that then manages the data resources. We also have testing and insurance. Testing the quality is fine, but at the same time this testing comes with releasing of the user's data. The most important thing is how we manage the user of the data. It's extremely important in the data world. We also monitor the usage and monitor the results, as it is more complex. That's why we have to use more tools.
For any software, managing usage and monitoring the usage is important. In most of the chatbots or slack channels, where I looked for people who are dealing with data, they were using DBT. But at the same time, they were complaining about the lack of other solutions.
In the near future, building data platforms will become a priority of all large enterprise companies. Most top tier banks would say that they are building a data platform, which they are all trying to do in order to solve that problem.
Looking back at the last 10-15 years, people were using relational databases, meaning they had many databases, and they had to put all of that data in one place. People came up with the warehouses, and given that data was increased we started to use ML. Data then expanded into data lakes and, subsequently, the amount of data processed. It is not a surprise that at some point the amount became crucial. So it has become a big problem and a top priority for many companies. However, many companies are investing right now, and in the near future there will be an availability of good solutions.
It sounds very inefficient for every company to come up with the solution to a shared problem, right? Why is that not the case in the DataOps world?
Many companies that are doing DataOps, such as Airbnb, Amazon and others are building solutions in-house. There is a lot of repetitive work. It’s not easy to be adopted by everyone. We believe that if there are some open source projects and open protocols then it would be much easier. However, this has not happened yet. At Synthesized, we aim to be part of the open source world. We have plans about leading standards and going into that direction.