Marc: I’m Marc Degenkolb, the Chief Operating Officer at Synthesized. Thank you in advance for participating in this series that we call Mind the Data Gap. I have the pleasure of moderating a conversation between two people that I've known for almost two decades.
Ironically, they were both customers of mine back in the day. Don Brown, who is now the field CTO at Synthesized was my customer when he was at EarthLink. And Coty Rosenblath, who is the CTO of Katalon, was the Chief Architect in an organization called the S1 corporation. Fast forward and we're at two different organizations with the opportunity to have a conversation around data-driven testing, and the value of API testing with synthetic data. We’ll discuss the value of what it means to have an automated testing software tool developed by Katalon and that integration.
When I was doing some due diligence on both sides, each of you seem as though you've spent about 50/50 on the customer side and the vendor side. But Coty, start by defining your role responsibility, and what's happening at Katalon.
Coty: I'm CTO at Katalon, and under that, I run our technology organization. That goes from our software development to our operations management. Everybody who both builds and operates our software. I joined Katalon about two years ago, but we've grown a lot since then. There was an idea that started several years ago, which is effectively an IDE for quality engineers and testers to build automated testing environments. We’re expanding through our TestOps products which help to orchestrate and analyze test results to our test cloud product, which gives you a place to run tests against custom environments with operating systems, different browsers, and different mobile devices. Our goal over the long run is to continue expanding that platform and the capabilities we bring so that there's a unified experience for quality organizations.
Coty: Katalon spun out of a company called KMS, and I knew one of the founders of KMS, Josh Lieberman, from the Atlanta tech scene. We stayed in touch over the years, and I noticed that they were looking for a CTO. I’d known about Josh's co-founder through him, but I had never met him. And when I told Josh I'd be interested in hearing what they were doing and learning more about it, I met his co-founder, and it was his vision of expanding this platform, particularly his vision for how we could build a platform that consolidated the data that was generated from the testing process and turn that data into additional features for customers. That resonated with me because I had been at MailChimp before Katalon, and led the data science and data engineering teams there. We knew what we were able to do there in terms of turning data into new product features that were helpful for customers. I saw that as a great place to be in the long term. We are still on that, and it is why we are still expanding our product platform so that we can collect more data and turn that into insights and features that help our customers do better testing and ensure better quality delivery.
Don: My role here at Synthesized is field CTO. And as Marc mentioned early on, I've spent fairly close to equal parts of my career, both on the engineering side, as well as the field sales side. I’m mostly focused on operational elements, so both security and infrastructure as opposed to pure product engineering. I have done quite a bit of data engineering in my life. That confluence of both operational data, and big data, and applying that to real-world problems is where I like to focus my attention.
Synthesized was particularly interesting, as a couple of years ago, I was the Chief Information Security Officer (CISO) for a subsidiary of Ford. Marc asked me to evaluate Synthesized before I was even looking for a job. I had come in and looked at it from that perspective. I found that the idea of synthetic data has been rattling around for decades, and it means different things to different people.
I needed to wrap my head around what it meant in the context of late 2021 and early 2022, as opposed to 2002, which was the last time I'd deeply looked at synthetic data. And in my role at Ford Autonomic, one of my jobs every quarter would be to help the teams evaluate requests for data access to susceptible data sets and make sure that compliance and risk were all buttoned up. We had all the audit trails in place. Our most valuable data set there is the GPS telemetry from every vehicle. For Ford, since roughly 2006, it’s been stored in a single data warehouse.
The implications and value of that data are obvious. For example, I drove a Ford for a long time and could look at my history which shows that I went to go get coffee at 8:00 AM every Monday in a particular neighborhood. The requests for access to this data numbered in the hundreds over the two and a half years that I was there, and we never gave access other than to internal use, because of the reputational risk involved. That's how I came to synthetic data and how I joined the company. There’s an opportunity to be able to create a synthetic data set theoretically from a real one that is simply too sensitive or has too much risk attached to it from a reputational standpoint if it were to ever leak. The ability to turn that into net new top-line revenue is fascinating to me. There are also more intuitive benefits that we will get into shortly around higher quality data, better outcomes for data science projects in particular, with synthesized data than most people traditionally think.
Don: Yes, the primary driver we're seeing now is a continued collapse and focus on the developer. The developer is the single focus of everyone in an organization. There are names for it now such as digital transformation, and Marc Andreesen coins it as “software eating the world”. It continues to eat the world, and the developer is at the center of that world. I think of groups such as QA and security as extensions of that developer, and what we'll see is a continued compression of the responsibility of the QA developer and security folks into a single role. It will never exist in a single role, but it will start collapsing more, all in the interest of aligning the developers' productivity with the ability for them to deliver in a short amount of time.
There are a few different dynamics by which we talk to customers around synthetic data. The first is the normal license renewals, and expiries, and when people need to update the test management software they have. Interestingly, as people start to move into the cloud and start to explore what's possible with digital transformation, they look at their entire stack. In many cases, in the enterprises that we focus on which are in the top 20 to 30 in the world, mainly financial services and healthcare, which tend to be even more slow-moving, this is a change that they don't get to see in the market very often. They want to seize this opportunity because data has gravity and the tools that surround it have lots of gravity. Once you sync the data pipeline into place or sync one of these CI/ CD processes into place, not only is there not much value in touching it but also there's a lot of risk in touching it. We want to seize that opportunity, and get ourselves embedded into these digital transformations and cloud migration projects using an API, as opposed to what's historically been done in this place, which is manual and tedious work that’s specific to the individual application groups.
We also see drivers around regulatory issues. So HIPAA and PCI, where financial services and healthcare tend to be big. Those are table stakes, but certainly interesting, and a big driver for why we see people jumping into the test data world. But we've noticed (and need to do deep R&D on) that there are significant boosts in model performance as we dip into that world. And some of that is due to simply better data preparation, cleansing hygiene, and standardizing imputation methods.
We're removing the nulls the same way, and you get a better quality product if everyone's centralized on the same framework. But additionally the GANs themselves, the AI underneath appears to be boosting the underlying signal in ways that we had not anticipated. We have customers that are seeing improvements in fraud models to multiple points. And for these customers, which are substantial organizations, a multiple-point boost on a $100-$200 million problem, equates to a $5- $10 million ROI in the first couple of months they implement this. We've seen this more than once. This is not an anecdote or an outlier.
Finally, something I mentioned early on, but still in its infancy, is a move towards how we sell data. What’s interesting to see is that the response to the theoretical ability to sell data is foreign to people. I had one North American bank that brought me the problem. They said their marketing team has datasets, and they have third parties that they interact with and have been trying to figure out a way to sell them data in a clear, concise, and risk-free way for years. The next day, I had a conversation with another bank with a similar profile and I brought it up unprompted this time because it was interesting to the other bank. Their response was completely allergic. They said they would never do that or even consider it. The reputational risk is too great for them, even though they know it's completely synthetic. As a new market, that might be the most interesting thing in synthetic data.
Coty: I have a few queries based on that. Our customer base spans from small shops to enterprise financials and healthcare. Still, with those customers that are making that transition from an on-prem data center to the cloud, I wonder if they're having to revisit data security, and data privacy models, as they get comfortable with the fact that not everything is behind their firewalls anymore. What I see is that people are having to rethink what it means for us to have access to different data sets. We ought to be making sure that we're not using anything we Don't need in a test environment or any type of pre-prod environment.
Don: What I've noticed there (and it's amazingly consistent both among prospects and people that we work with) is that in financial services and healthcare, the highly regulated industries, there's a modern data platform that's emerging within that group. And it is predicated around two tools that most organizations don't engage with unless they're tremendously mature data organizations. That is the data catalog and data governance tools. For example, a BigID or a Collibra paired with an honest data catalog. And using those as your sources of truth to push anything out to your developers.
Rather than having your developers go to a data warehouse and pull raw data, which is what I've historically seen people do to get test data, whatever the use case ends up being, all of that data flows through sophisticated data pipelines, and the ETL tools and orchestration will be different everywhere. But at the end of the day, you're producing things into a data catalog, and with that data catalog, developers will be able to interact with it in a variety of ways. I can't produce anything about those people in a certain data set, but I can produce synthetic data, and I can give the raw data for everything else. Being able to get arbitrarily complicated based on those dimensions in the data, and then build that into a pipeline is tremendously powerful. The other thing that I've noticed is that people are going after the data mesh abstraction now in earnest. For a long time, it was purely academic for me. Martin Fowler's writings on it are interesting, but I've never seen anybody go all the way to the implementation. I can think of one pharmaceutical company that has it in production, and another bank. Now their developers don't think about producing test data. They think about a data set existing in this data catalog, and they push a button and add some constraints. That's the extent to which they have to think about producing test data going forward. Optimizing for test data coverage etc. becomes a developer problem, not an organizational problem, which is how I now think about it.
Coty: That’s a great observation, focus on the developer. That is also the result of increased pressure on development organizations to deliver quickly. Consequently, you want to remove all the friction you can. That is something that I'll spend some more time thinking about as we’re developing our platform, in terms of how to help organizations know the details of where data resides and how to get the right set of data. That is part of what Synthesized’s synthetic data is; being able to create novel data sets that may need to be created to express certain cases that are in your test scenarios.
Don: Both. In both cases, they had an existing model that was running on non-synthesized data. And part of the proof of concept (POC) was to take it, run that same model on a synthesized data set, and see if there's anything we could do to improve the synthesis process to get slightly better. And iteratively we've been able to do so. Many of these data science models, particularly in mature domains such as fraud and churn, don't get to eke a point out of those models. They've been working on them for decades. If we can come in and show even half a point of lifts against those models it's tremendously valuable for them. We are working internally to try to understand why exactly this happens. We have intuitions that the GAN itself is amplifying the signal that it sees in the data. But certain types of techniques tend to work better with synthesized data. Random forests, for example, tend to respond tremendously well as opposed to clustering techniques. We need to do research.
The idea that most people use similar data sets, but then apply a cascade of models or an ensemble of models behind it is a compelling run. If we can start to define schemas for these people and say that if you get your data in a certain format, provide us a data set and we can synthesize it, with these techniques you should get to N percent of fraud identification, or churn. This is all pure R&D for us. We've started to know this trend and have enough at-bats in the market to make it repeatable over six months or so. There are few times when I've seen people's eyebrows go up as much as when we move the needle on those models.
Coty: Something that is moving this interest in synthesized data is the increased criticality of data to some systems. For a long time, everything was defined in code, but now we have systems where the data that you feed it is defining how the system ultimately performs. That raises the criticality of having data that represents various distributions, and various forms, and I’m always interested in seeing how we can package some of that knowledge for customers trying to meet these new requirements.
Don: As an engineer, and someone who's built a few applications, if you're coming at the problem from a data-first perspective, everything changes. If I have to define my data sets, define my schema and everything is relatively well understood, what kind of inferences could be made for me as a developer, from a tool such as Katalon?
You could also give it some logic from the application too. But from the data types and their relations that I'm willing to define either through DDL or other ways, if that could do 80% of the boilerplate QA tests for me as a developer, at least on the data typing, then you have solved an enormous problem without having done any more developer work. All I have to do is send a REST call to a tool and those tests get implemented and ideally are embedded.
Coty: We've been thinking, particularly in our API testing, about how much we can generate from that. If you've got swagger documentation defining your API functions, that tells you a lot about what can happen in the API, but it doesn’t tell you a lot about what data might flow through that. The things you talked about in combination with the description of the API, start to form the constraints around what could be used to generate several test scenarios that would exercise logic and data in a way that would be useful.
Don: I ask people what data types they have for a couple of reasons. One of the reasons is that l follow up by asking what tests have they disabled because their system has data types in it that it's not supposed to. That tends to indicate a highly brittle test process, and in almost all cases in an organization of any scale, they've done some sort of bulk load at one point into a database in the interest of speed. And they disable all the constraints when they do it. I had this conversation with a data engineer at a bank in South America a few weeks ago. He gave me a dump of his database and he said his tool was not working. And I said, “here's the error. You’ve got this data,” and he said that data couldn't possibly be there, because they don't allow enum types in that column. They may have done a bulk load, and that's why these data types are allowed in the database.
Coty: Even if you didn't do that bulk load, systems evolve over time. Now there's a constraint, but has that constraint always been there? I often see that customers run into it in their testing too, and have to disable some tests for a while as a result.
Marc: You both mentioned the importance, from a security and governance perspective, of getting production quality data. Don and I hear it every single week, as customers and prospects tell us that we're providing synthetic data of this production data source in a better quality format for use in the use cases you were talking about. It's interesting because we're also hearing this theme of traditional legacy masking technologies or traditional anonymization techniques that are no longer safe. This means there is no such thing as anonymous data. We're spending a lot of time educating the market on the ability to mask data, and then having the ability to create a synthetic copy and mask it.
Don: It’s important that you use the right transformation for the right use case. There are costs to doing generation, which is what we would typically consider the best and most unique in terms of capabilities. But masking works fine in many cases, and sampling works fine too.
It depends on the use case. For example, masking may work perfectly well in the scenario of application testing, where you're doing a functional test, and you're masking a social security number. I encourage the first three in the last one in masking, because we want our developers to be able to see the first three as that gives you valuable data. The first three would indicate what state you were born in. And picking the right type of transformation is important. Generation presents its own set of challenges. When we say generation, we mean we're going to have an AI GAN effectively look at the data set and then reproduce one that looks as close to that data set as possible, with some constraints that you can apply on the side effectively via YAML. For example, if you only want to have people from Texas on the table then it would produce only those people. That's very valuable, it checks most of the boxes, but it is expensive in terms of performance.
It depends on the use case, and whether masking sampling generation is the right choice. From a pure security perspective, people generally worry about generation and the concept of differential privacy, which is the idea that by using generation you're gonna create a new data set that is so similar that you can do AI and BI modeling on top of it. To do that you give a couple of parameters. One of the parameters is called Epsilon and that represents the amount of noise that you're introducing to the data set. Lower values mean more noise, and higher values mean less noise. When the iPhone first tried to employ differential privacy back in 2016, its Epsilon score was about 41. We generally encourage financial services companies not to have an Epsilon of greater than one. That’s the level of security. It's a logarithmic scale, not a linear scale, so that's a significant difference.
The more important value is Delta, which represents the likelihood of any individual data set getting out into the wild. That's the number that people should worry about, because your data exists on spreadsheets, in every part of your organization, whether you know it and like it, or not. I think differential privacy is an academic exercise at the moment, where people are getting the opportunity to say they now have parameters that they can establish risk around for creating these synthetic data sets. What do they think the ROI would be? The question becomes “how forward-facing or forward-looking do you think those threat models are?”
Nothing is saying that somebody couldn't come out tomorrow with a new technique that completely upends the concept of differential privacy and makes your Epsilon value effectively worthless. All the data that you've been selling and reselling is now effectively in the wild. You would have to make sure that you had other controls involved, such as not selling it to the public, sharing it with trusted users, etc. As you think about governance and privacy, those are the things that rattle around in my head, and as CISO differential privacy is a tremendously interesting tool. We're yet to see a large enterprise that is willing to bet big on differential privacy in terms of offering it to their users and reselling.
Coty: One thing that strikes me is that from the very beginning, your description of generating data for a development organization, and having these ways of getting data is critical. In regards to enabling developers and your entire development organization to move quickly, I often see development stymie and QA stymie when they need data. It's held up because we don't have a way to give you what you need in a timely fashion. Having tools at hand is the real critical point so that you're not stuck with only production and some scripts to de-identify. To be able to have all of these options and make these calculations to quickly respond is what the executive teams want to get to. A place where things are not held up because you can't serve the developer or QA tester.
Don: I'll add a qualifier to that, which is trusted tooling. When you're dealing with test data, the risks are significant. If you end up producing a data set that doesn't abide by the security controls that your team wants, you put your team at significant risk. That's an important piece to call out in not having those teams. I almost think of it as encryption software. You shouldn't be rolling your encryption software unless you deeply understand what you're doing. You're facing many of the same risks here as you do with encryption, hence why you need to think about it in much the same way.
Coty: It's a perfect analogy. I’ll apply the same rule to synthetic data generation. Anytime I see someone with a novel crypto algorithm, I say that there are a lot of people who have focused on this, and what you think is a novel insight is probably not. The same thing may be true in the data privacy world. You may think it's enough to do X or Y, but in fact, you have not thought through the problem as deeply as a lot of other people.
Coty: Simply put, Katalon Studio is an Integrated Development Environment (IDE) for a quality engineer, a tester, or a developer to develop automated tests. We support automation on many different fronts, such as web automation and simulating interaction with a browser, simulating API automation, simulating REST or GraphQL APIs, and mobile automation. We also do desktop. In some of the organizations that we're talking about, big financial institutions, in particular, there's a surprising amount of desktop software that is still a meaningful contributor to their infrastructure. We have an environment where an engineer, either a quality engineer or software engineer can record tests by interacting with the browser manually and then capturing that or doing the same thing on a mobile device, and then take the scripts from that and augment the scripts for general purpose. Most relevant to our current discussion, we have built into the development environment the ability to attach data sets to a script or a set of scripts, that we call a test suite. You can then run the same test, but with data and we can connect to your standard data flat file data formats such as CSV. We can connect to a JDBC database endpoint.
That's where our studio environment comes into play. It’s a place to author these tests. In regards to building a platform, from that, we can then take those test suites, connect them to our test operations platform, and let you monitor the execution, success, and failures. You can orchestrate them so they occur at the right point in your pipeline, and then we give you in the test cloud a place to execute those and get all the right environments to do that.
Coty: There are several hard things, but we've simplified a number of them, and we've left some harder ones. Some of the hard things are understanding the system under test and being able to capture its behavior. We've addressed that with our recording capabilities and the ability to take a recording, extract the specific parts, and make it a more general-purpose script. Connecting that to data so that you can take a general test to expand and run it across several scenarios is legitimately difficult. We've tried to make that easier by giving you very simple ways within our tool to connect to the right data sets, but that still leaves the complexity of other things. Generating data that looks like production is a consideration.
One thing that people find particularly difficult that isn’t obvious is when you're generating a full web workflow or a complete API sequence, knowing what the connection between one data set and the other is. What is the connection between my set of users and the transactions they might invoke? Users often have specific permissions, and they may not be able to use every function. Being able to generate data that looks right on its own, but also connects is something that everybody finds challenging.
Don: I saw a graphic online the other day, which only used the QA app and security teams, but also ML teams. It started with the app team and it had an app developer who was holding a cat and he was slowly petting it, and this represented his level of testing. He was slowly petting it, looking at it once a day, making sure it was eating, and that's the extent. But he has full context on that cat. He knows everything. He knows what it's eating and how it's doing on a day-to-day basis. QA teams miss a lot of that context, as they don't live with the application day to day, and are not in the code every single day. Provide as much tooling as you can, with the 80/20 rule, but for testing and then let them write the tests that are hard with a tool like Katalon. ML teams, in my opinion, miss fidelity. Data's a mess, it is painful to deal with, and they spend 80 to 90% of their time getting data into a position to deal with. In that cat analogy, they're the ones that are the pet sitters. They don't get any context, and they get a mess, but generally, their role is to play with the cat.
Then we have the QA teams, who are like the vets. Ideally, they should have all of the functional business requirements buttoned up in a document, and be able to write tests to address each one of them and go through it. They get their context provided externally as opposed to the developer who writes it internally.
Finally, you've got the security team, which needs specialized tools, because, in the cat analogy, they are going to throw everything at that cat that they possibly can. They're going to tip it upside down, flip it right side up, turn it at every angle and insert everything into it that they can think of to figure out what the response would be, or if they can break it. That’s how they deal with test data. I keep referring to this as primarily a developer problem, and everything else that flows and is an extension of that. If you think about it that way, you tend to get to better solutions. When I hear about Katalon, and building workflows, what I'm glad to hear is that we want to get to that first 80% that the developer has to do no matter what. In every application you have to do logging, you have to do a certain set of activities to be considered enterprise-class. How much of that activity can be handled by the tool, so that the developer can do the last 20%? It takes as long as 80%, but it's the part that has value and it's the part that's harder. When we think about developer toolkits and frameworks, that's almost always how I think about it.
Coty: That is a great characterization because that is what we aim to do. Simplifying things to avoid having to constantly gather a bunch of stuff to solve these problems that are relatively well known, and that we can tackle for you. But we can still give you connections to more complicated tools that you may need to introduce into the environment. We aren’t going to solve all of your data synthesis problems, but we can get you connected to that data. I want to introduce our customers to a place where they can go and get that data.
I've spent most of my life as a software developer, developing product features. What’s often lost on people is the different perspectives that specialties bring to the world. As a software developer, I'm trying to find a positive path through the system. I'm trying to create things that work. Conversely, my QA team is often thinking about what the edge cases are, or what things are going to break. This is a blind spot I often see for developers. QA and security teams consider real-world factors and introduce an interesting edge case or an element of a real-world data set that you haven't seen before because you thought it was all constrained.
Marc: In the conversations that I get pulled into around the API integration, and leveraging easy-to-use YAML configurations so there's no complex coding required, it sounds like there's an amazing scope for the joint offering between the two solutions to provide unlimited volumes of high-quality test data in minutes. I know you both have a history in ML. Unlike some of the legacy data, and virtualization technologies, where you have to make full copies of the data set, having the ability to either do a full copy or create a subset comes into play. Then having 100% complete coverage of the test data for all functional and nonfunctional requirements becomes more valuable and then having it completely automated through the platform with your technology as well.
Don: You brought up an interesting point. I own no small part of the liability in this, being an early big data person, where we insisted that all data has value. Fast forward 12 to 13 years and we've been doing this at scale across a plethora of organizations. Two years ago I was a CISO. Most data is a liability, either in terms of the cost of managing it, or as it sits there because no one uses it properly. In almost all cases, data acts as a liability, unless people mature their organizations and can harness the value. It means that you need to be working towards a mature strategy around classification and depreciation, especially if you are in a heavily regulated space.
I see a lot of people do classification, but I rarely see people doing deprecation properly, where they're aging data out and getting rid of it. Even if you think you have a use case that a bit of data from 2003 is going to present a ton of value around, are you ever going to get to implementing that use case in this decade? People need to think about that because they only have limited resources, and there are only a limited number of hours in the day for your application developers to work. What data should you be focused on? Where do you think you can find value and make sure that the data that you keep is applied with rigor?
Another point is being able to codify both your QA testing, as well as the test data you need to generate in YAML pipelines, and then have immutable audit logs on the other end of that. Either in one of these two platforms or the cloud providers themselves. It's hard to overstate how valuable that is to CISOs, auditors, and the risk folks of the world. Putting all that together, post hoc is an incredibly expensive and unfun activity as you might imagine. People need to think about the value of their data and, think about whether they need to keep this and cram yet another piece of data into data lakes.
Coty: I’ve worked with my CISO, and he shares those perspectives on the long-term value of data. From my own experience, the idea that someone's data from 2003 will be valuable is often flawed, because systems evolve. It probably doesn’t follow the same rules as data from 2022. The other thing that I have kept coming back to is the ability to take these tools in terms of synthesizing data, and using them to enable development organizations to be more effective and deliver quickly while staying within the constraints that you need to set as a business, particularly if it’s highly regulated. Don mentioned earlier Marc Andressen’s comment on software becoming a core component of every industry.
We see our software being used by companies that we wouldn't have historically thought of as potential customers. We're trying to do more and more every day with software. Anything you can do to do that more intelligently is critical. Having intelligent data generation is important. On our side, we’re trying to help to determine what tests should be run given the level of risk, and the changes that are happening in your system. Part of that is R&D for us, but part of that will be based on the data that is being processed. Keeping an eye on what can streamline your delivery and all the tools that you can utilize there are going to be what makes your organization separate itself from the pack.
Don: We ran into a startup bank based in South America. They recognize the problem of older data tending to be a liability as it decays over time. Think about data types in the way that we think about partitions, especially for time series data. You'll break it down by year, month, and day. Think of it from a different perspective. Rather than storing all the data for a year or a week, we store a generator. From that data, you get rid of all the raw data, and you’re not carrying that in the enterprise. But I could generate a synthetic data set from 2003 and carry that forward and know that it still maintains all the statistical properties of the original without having to bear the costs of maintaining it.
This is 100% a net new capability that we've never thought of in the past. It's something I'm still trying to wrap my mind around, but giving people the ability to retire data, especially operational data, as opposed to the business data, is interesting because you don't necessarily need the specifics, but you want to capture the characteristics and be able to generate it going forward.
The costs in terms of compliance would be nothing, and storage cost goes down. The question is what level of trust does someone that comes in 10 years from now have if they’ve never seen the raw data, but have to use a generator that was created 10 years ago by someone who hasn't worked here in 7 years? This is more R&D that we need to do, but it's great to have these types of problems come to us, to help us be at the forefront of this at the enterprise level.