Baruch Sadogursky, Head of DevOps Advocacy at JFrog, recently joined our very own CTO Denis Borovikov and Engineering Lead, Seva Brekelov to discuss the impact of DevOps on software development, how DataOps can transform how companies work with data and how closely DevOps and DataOps are related. This is the first post in a series summarizing their conversation.
To catch the full podcast, click here!
DevOps is one of the biggest transformations in the IT industry ever. Even if it wasn’t the fastest, it was very, very successful and has taken the industry by the storm. DevOps is a set of collaborative practices and a culture of collaboration between the Dev part (the developers), the Ops part (the infrastructure) and the production, the "Build" parts of the engineering organization. An effort to overcome the siloed, classical organization in which the software is thrown over the fence from Dev to Ops. It evolves to include all the parts of development of software from testing, security, product, management, to all the other aspects. This is DevOps in a nutshell.
The DevOps practice started over seven years ago. There’s now more attention on DevOps and DevSecOps in the data science and data engineering world than ever before. The challenges they solve are similar and also need to be solved in a similar manner.
DevOps was about silos elimination. There are similar problems in working with data. And if you talk about silos and the data world, there’s typically a silo between data analysts and data engineers.
On one hand there are business oriented people with dashboards, not technically savvy. On the other hand, there are extremely technical people building very complex pipelines. In between the two groups is the silo.
For example, if a dashboard needs building, the business-oriented teams have to go to the data engineers. The data engineers might offer support, but schedule the support for next quarter, after finalizing a new data warehouse. They will have to get data from the warehouse but it is a time consuming process to finish the project. This is where the opportunity for DataOps lies.
The data consumers are the users, such as product managers, data analysts, business analysts, data scientists, and ML engineers. The data has to be accessible for them in a format they know how to consume. On the other hand, you would need their input as well. What they know at the time, what they actually need and whether they can get the data that they really need.
Additionally, there is a separation of knowledge: on one side there is the business knowledge of the data consumers, who know the semantics of the data. On the other side there are the data engineers who typically do not understand the semantics of the data but are experts in the tools and this creates a large contradiction, because engineers are content with helping to build pipelines, but they don't necessarily know the meaning of the data. Various users know the meaning, but it's too complex for them to start coding SQL transformations to create all these staged and modeled views of the data.
DevOps has become an intersection of software development, quality assurance, and some technology operations.
DataOps is the intersection between data engineering, data quality, as well as data security and elements of data integration.
There is a distinction between DataOps and DevOps for data because the data pipeline is much longer than the development pipeline. In the development world, pipelines can be very short. In some cases it can be as short as just deploying master data to production, but in the data world that's never the case.
A data system can be viewed as a set of complex pipelines. Sometimes people use this metaphor of data fabric. How people see that data fabric is that it's a lot of data sources, data sinks and data pipelines shifting data around. You have lots of rows with a lot of different owners of the data.
And if an organization is large enough, departments such as compliance, will also be heavily involved with their own processes and policies around the data. This is why in the DataOps community people relate it to lean manufacturing.
The DataOps community places emphasis on techniques such as lean statistical process control. The work is seen as a pipeline, a real factory versus how it is seen by people in DevOps. They lean more towards extreme programming, such as test automation and similar techniques. DataOps is closer to lean manufacturing than DevOps in methodology.
Lean manufacturing, from the classical software engineering perspective, is a close metaphor to software engineering as well.
Additionally, the pipelines are evolved so that people who are more distant can see them. There are many more scenarios and roles involved in the DevOps pipeline as well, which include a variety of different aspects. Although it is not simpler, the differences are significant enough to justify the distinction. DevOps are in what is integrated, and not necessarily determined in how complex the integration is. It's not a matter of there being less circles on the dev ops diagram because many more circles can be drawn in the service diagram.
The circles are different and this is the most important difference between DevOps and DataOps. As a result there is a lengthy list for DevOps too. What really matters is that there is not a lot of overlap in terms of the skills and the teams that participate. This is the biggest difference, not necessarily the complexity.
The roles and the nature of these roles are also different. This factors into the notion that the DataOps community is agile.
In the software world, people have become bored with agile. A common view is that there is already too much of agile everywhere, and that it is entering its final stages. There is a data world, which is something new and interesting, because data projects tend to be more structured.
They tend to be organized as waterfall projects just due to the nature of the work. Data analysts make scientific insights for the business, which means that there is a need to follow some scientific process. It becomes hard to skip stages and try things quickly to see what happens. That tends to falter in elements of data science, such as doing experiments. This requires systematic working, which subsequently makes the process more staged. Additionally, the idea of a data spring is a new concept and people are still figuring out how to apply these kinds of data springs to the data world.
Ten years ago, there was no standard approach to continuously developing everything in a single click as there is now by setting up GitHub actions, which allow developers to do all these things well and fast. Now, every single developer can do it.
The same idea applies to data currently, as it is too complex to set up some databases with particular properties, and data in certain states. It's not as easy as with software development.
The stage waterfall model is unavoidable, but you can run your process as a sequence of small, fast waterfalls in a way. If each stage is made faster, the sprints are going to be longer than in the development world, but it still promotes being agile in some form.
Ultimately, it is an iterative process everywhere. It can be called a waterfall or a small waterfall or agile iteration. And even the concept of pipeline, means things are done in sequence.
So it is not about leaving the sequence and order, but rather a matter of doing things collaboratively and consulting each other more frequently than before. It's very important to create empowered teams that are composed of people from different areas of expertise to unlock the full potential of this new way of working.
The most efficient way to actually implement DevOps when it comes to team structures or team topologies is to empower a team of cross-functionality specialists that come together, work together in some sequence. They will have their own processes but the communication is reduced to the level of a single team instead of trying and producing an effective cross team or cross organization communication channel, which is much harder to achieve. It is something that happens in the data engineering world.
The expression, self service is also common. The idea is that the more you can do within a team the better. Delays happen because people need to communicate with other teams, but the more that can be done within teams, the faster the process. And there are many people involved in the DataOps processes.
That is one of the biggest challenges when people are trying to build a data platform.
Our next post in this series will cover how to build DataOps teams and the skills they need to succeed. Thanks for reading!