This episode of the Mind the Data Gap podcast features Dr. Shruti Kohli, who is currently the Head of Data Science and Innovation at the Innovation Lab at The Department for Work & Pensions (DWP) in the UK. Shruti leads the entire innovation across the department. She previously was the Lead Data Scientist at the DWP Innovation Lab, received her Ph.D. in Machine Learning, and specializes in data science, design thinking, cloud computing, and other techniques. She has a very interesting take on the Center of Excellence for DWP, the key pillars, and how it can be enabled.
Nicolai: We are joined today by Don Brown, who is the Field CTO at Synthesized. Don is joining us from Atlanta, Georgia, where we launched our U.S. operations as a company, and he is leading some of the fieldwork from our office there.
We have enjoyed collaborating with Shruti on different initiatives, and as part of this podcast, we want to discuss several topics. One of the topics is the Center of Excellence (COE), and the challenges and bottlenecks associated with it. We also want to discuss data as a potential fuel to accelerate innovation across many innovative projects, and how we think about topics such as data governance, data quality, and data compliance. Both Don and Shruti will share valuable stories. Finally, we will touch on synthetic data, which is the new way to think about data and information for innovation and we will learn about their experiences with that.
Shruti: I'm currently leading the DWP Innovation Lab. I’ve spent almost five years in the department, and I've moved to a couple of teams. When I came it was called Dojo, a tech accelerator focused on how we could trial and test new technologies, and where they could be used in the department. Looking back, we have evolved. We are now more focused on how we can solve complex problems in the department, how we can support our colleagues to provide better services, and how we as an experimental space can help them by doing a lot of technology scanning and experimenting with them. We want to align them with our users' needs, working in an agile design session and seeing what is possible in terms of the prototype that we build. Even though we don't do production things, we are more in the experimental place where we build prototypes, but we have seen how projects that we have done have helped our colleagues to enhance their vision of what is possible.
Many times, some of the findings were providing a lot of learning that opens up their vision of what they want to keep in their future roadmap. Sometimes it involves failing fast, which saves a lot of time and effort for our colleagues who may be thinking in the same direction with a quick experiment. Those are some of the successes where we work to see what are the best ways we can deliver our services. At the same time, we make sure that if something is not for us, we're able to quickly find out.
Shruti: As of now, the Innovation Lab is aligned with the strategic objective of the department. We do a lot of external health and scanning to understand how they could help in meeting the department's strategic objectives, for example, data sharing, which is important to provide improved services to citizens because that helps you to make your data-based decisions. But now it's hard if we are not able to share the data. And in the public sector department, it is very tricky to share that sensitive data.
What can be done in that space? We did a couple of projects in this particular space to experiment with a variety of Privacy-Enhancing Technologies (PETs) to see which ones could work for the department. You could name differential privacy, and we went exploring. We didn't find much around that, which made us do a big experiment with synthetic data. Also, we had been looking at how we could give citizens control of their data. Are there some technologies available in the market? That led us to experiment with data ports with our colleagues, to try and find out how data ports could give citizens the empowerment to use their data and how it could be connected with other things.
Nicolai: Innovation needs data. We often see that data needs innovation, and the way we treat data, share data, and provision data needs to be innovated because traditional techniques such as data anonymization have some limitations. They can only work when you have clear access to an original data set. But sometimes you need to create completely new data points.
Shruti: In terms of synthetic data, we started a Synthetic data, Applied analytics, Innovation Lab (SAIL) program, where we wanted to see how good a fit synthetic data is for the department. In the last one and a half years, it has boosted us and given us confidence that it is a tool that could help the department to resolve some of the challenges around data. For example, if we want to test some new technology, we cannot expose our sensitive data to it. Thus, synthetic data can help us in that particular space if we want to externally collaborate with other universities or invest in crowdsourcing innovation. We cannot expose our data. We went ahead and did a data challenge with The Alan Turing Institute because, for synthetic data, we wanted both academic and industrial validation. This boosted our confidence that it could be done. It helped us to create a bit more awareness, not only in the department but also in other OGDs. That is an area where we should build some capability, and I even took this particular idea to the civil service data challenge, where it was accepted and we worked to one level.
The first major success was to have confidence that it is going to work. The second success was to create that right awareness where your colleagues trust that they can experiment. And when we experimented we found that there are still some challenges. Those challenges are around the governance of this particular data and how many iterations are required to get the right fit or right size of synthetic data. It has helped us to nail down what questions we need to answer now.
Don: Yes, and I've also been the third pillar, which is the security person that prevents people from getting access to data. I've worked on the engineering side, the security side, as well as the field sales side. They are all different in their perception of data. I started my career in 1992 working with a test team at IBM. I remember using load runner and tools such as that to produce our test data sets back in those days. And it was a highly manual process driven by a Graphical User Interface (GUI) that my friend Ming and I were in charge of running once a week or once a month depending on the particular data set. Fast forward 30 years now and we find ourselves in a very different place. The primary value propositions that I see for synthetic data and test data now are different for a couple of reasons.
The first reason is the move from waterfall to agile completely changed the game in terms of people's need for test data, how they use it, and how they leverage it. Secondly, we are almost simultaneously seeing the advent of machine learning models and their placement inside applications. It’s not solely outside of applications in the data science area or a statistician area or actuaries, depending upon your particular industry.
Now we live in a world where people need test data in real-time, and they need it curated. Also, they would prefer not to talk to people like me in a former life when I was a CSO at a Ford subsidiary. I think we find ourselves as a neat confluence of a bunch of different areas of tech coming together simultaneously.
Synthesized can now be an API and drop all of these sorts of requests, be it for machine learning models, application test data, etc, into your CI/ CD pipelines, and allow you to get programmatic access to that data rather than have the junior intern like me that did it for years from a GUI. Why is that important? Other than the fact that you don't have a junior person generating test data for advanced ML projects, it’s important due to velocity and standardization. I can hit whatever velocity and frequency I need for a particular use case that demands synthetic data on the fly. But more likely, I'm going to have it produced by some sort of CI/ CD pipeline. It gets dropped into some type of model management tool, like an HQO, and it doesn't know the difference. We're also at a place where people are trying to figure out how to measure developer productivity. Having run several teams of engineers in my career, I've found most of the metrics they're in to be mostly valueless. Lines of code are valueless.
When I think about developer productivity (this is consistent across both data science projects and applications) I ask what is the time it takes from beginning to end cycle time for you to create an application and take it to production? The thought is not to give me an MVP in production, delivering with fully qualified production data pipelines, with all of your instrumentation completed, and you've gotten past all of the security hurdles. Typically for most enterprise organizations, which we will deal with, it’s going to be six months to a year at best. That's anecdotal data, but I would have been asking specifically over the last year.
On the data science side, it takes even longer and is more expensive, because those people tend to be more expensive. Somewhat anecdotally it's somewhere between 12 to 18 months for an enterprise to take a fraud model or a churn model from ideation to production, and costs somewhere between $1.5 million and $3 million all in. Given that we're focused on delivering capabilities, we can now give people this data in a pipeline they've never had before. We generate this data using AI, which is interesting because you're also feeding the data to AI. We sit at a place where we've never been before in terms of our capabilities to deliver data. Now it's up to the enterprises and companies like Synthesized to figure out how the enterprises can consume this.
Going back to the original question, which is about the Center of Excellence. Centers of Excellence are not about delivering specific technologies, they're about helping enterprises consume leading-edge technology into traditionally risk-averse organizations. When we think about problems within a Center of Excellence that we're trying to solve, it's all about developer productivity, but more specifically it’s about teaching them to fish and getting developers engaged early in the life cycle. Design a program that allows them to cycle through both applications and data science modeling with repeatability. There are certain phases that you need to go through for every step, and then you layer on the people as you go. The first iteration is generally taught by a vendor or someone in academia, but each subsequent iteration needs to be led by more and more people internally, and that's how you gain competence. It's the only new way I've seen it work, especially at the high end of organizations. It's more about rigor, process, and helping people understand the capabilities and then teaching them how to spread those capabilities within the organization, not just keeping it within a specific silo.
Nicolai: And Don, you mentioned something important, which is the innovation Center of Excellence. The reason it's such an interesting concept is that innovation has historically been difficult to measure for various reasons. We need a framework in place and we need processes in place. But this framework of innovations allows us to combine all of that, equipped with the right inputs, and equipped with the right variables for us to achieve the right results and outcomes from innovation.
Shruti: As I'm running an innovation lab I believe we need to have a Center of Excellence where we can prove the different technologies. We need to prove what good looks like, and most of this can be done if you have got a good flow of data, which is not possible. For example, in the innovation lab, we don't work on real data, it is a total non-prod environment. The most important block or the key stage is to have access to synthetic data that is insensitive and intelligent, where we could plug and play with different technologies, and different tools, and as Don mentioned, create those testing environments to test services.
Second, I think the most important block to build is to build that particular trust and assurance. That will come through the governance process. For this particular Center of Excellence, even if we are using synthetic data, how is the governance going to work, and how could we validate the reports that are provided by experts that give us the confidence to fully trust the system? Thirdly, the most important thing is to create that awareness, because when the Center of Excellence works, they can prove their outputs by showing it, not by telling. And for that, we need awareness of the data we are using and how we are using it. And it is not simply fabricated data. I'm talking about data again because I feel that it’s the fuel for testing anything, for doing any prototyping, for testing any technology, for doing external collaborations, or for doing internal collaborations. If we can sort out these things and this leads to one objective, which is how we could get the confidence within the department that something can help us out, and is secure.
Shruti: If we can keep in mind the opportunities from data, such as decision-making, and if we can bring our colleagues to the space where data-driven decisions are made with intelligent, synthetic secure data, that would be enough. Even the policymakers could make the right decisions, not only within the department but also by collaborating outside. That will give us a powerful data-driven platform to make the right decisions and all in a secure way because no one is using any sensitive data.
Shruti: The users will be the employees and colleagues at all levels, the working level, and at the decision-maker level, even if you are working with a third party. And in terms of success measures, the first thing is how secure it is, and then how accurate it is. These are the two definitions of success here. We did a recent process mining project and we did brilliant work with you where some of the data that we had developed, we asked you guys to synthesize similarly.
It took some good iterations, but it then showed that this is achievable, but we need to have that particular problem statement ready, and we need to do several iterations. Getting to the point where accuracy, utility, and security are balanced. That is the way I would measure success.
Nicolai: One of the metrics would be the amount of usage. We want to make sure that we maximize the number of users on the platform. We want to minimize time to data, maximize the amount of data being available, and maximize the projects being launched by those users with that data. We want to make sure it's kept secure, compliant, and scalable across the organization and that it's enterprise-ready.
Don: The first thing you take into account is Shruti's point, which is you need people using the platform because if you don't have people using the platform, it's not going to be a platform for very long.’ It’s important to realize that within a start-up environment versus an enterprise, results are essential to everything. What I typically do is try to figure out a problem that is both approachable but also has a lot of impact within the organization. Preferably something with a lot of visibility as well. This was not done as part of a Center of Excellence effort, and there are lots of papers written about it, but Cloudera made its name back in 2010 because it improved search relevancy for eBay. That was a $100 million problem for them at the time and a $3 million ROI in a few months. But more specifically, it gave them visibility into what could be done in the business that moves it in large ways in terms of revenue or other types of impact.
When I look at use cases and when I look at the project sponsors, who are also incredibly important, they need to be supportive and interested and be willing to champion you assuming that you hit success. Find a project sponsor that ideally is going to help be your advocate internally and then you help make both them and their team as successful as possible.
As a software vendor, where internally the software may be implemented as a platform, it's now an internal selling proposition and everyone needs to be on board with the motion being taken. Even more specifically on the back end of that, COE generally takes a common pattern of training, requirements gathering, project management, architecture, delivery, and lather, rinse, and repeat. What you want to be focused on is bringing more people into the fold through each success of iteration. The project manager can generally remain constant because they can usually do 5 to 10 simultaneously depending on the complexity of the project. But the engineers that cycle through those are the most important people. One of the things that I learned early on that I didn't know before is one of the best places you can get feedback, is from those engineers that are working on the product and aren’t voicing a lot of concerns, but also aren't leading the charge in terms of implementation. Those tend to be the people who either don't get it or are having issues with the uptake. Focus on that cohort of users rather than talking to the people that are the best and the most vocal, as they are not the people that you actually need to get successful, and they're almost always successful on their own. You have to focus on the people in the middle and help them get skilled, and then they become that champion to the rest of the organization.
Nicolai: It is also public information that we are currently partnering with Deutsche Bank on helping them accelerate digital transformation with synthetic data and we’re looking to help with building this, Center of Excellence. And Don, you have a lot of expertise and there is a lot of similarity across the Center of Excellence, data hubs, and data platforms. And one of the key enablers is data. Shruti, you mentioned how hard it is to get access to data from stakeholders at DWP. We spoke about synthetic data, and that’s the work we had done together, and that was vetted by The Alan Turing Institute previously.
Shruti: Previously when we did the sales program and started exploring synthetic data, we wanted to see how we could do some explainable AI experiments in the lab and see if it would help us to develop some augmented intelligent tools that could make our colleagues more empowered when they are making decisions. It is more about how we could improve decision-making within the department. It is also about spotting trends so that we can help citizens better, or identify frauds and errors. If we can create those connected synthetic data beds we will be able to find better links and maybe that would help us to find those trends of fraud better. That is one area that could help. How could we quickly and deeply do good digital tests of different scenarios, where we are using intelligent test data rather than the test data we are using now?
Also, it is about how we could onboard the innovations that are happening because if we have to onboard those particular capabilities within the department, and if the synthetic database capabilities solve the purpose of keeping our data secure while helping us to build these capabilities, we can onboard them through crowdsourcing innovations. These were the three or four trends that we continue to speak about. I know there was some proof of concepts that have happened within the department and other OGDs are doing it. I hope that we get the confidence to invest and develop that.
Nicolai: Having access to data is a key enabler for innovation. And there can also be some very quick low-hanging fruits, like quick wins but the overall strategy needs to be implemented to enable the use. You need the right data sets across different teams and different units.
Shruti: Making communications and awareness to bring people on board, and giving them those satisfactory answers around the governance security risk, because being public sector, that is a major worry. Getting those right answers to the right people who could help us to enable and develop is more important. Having awareness, and communication, and giving them the confidence that it is secure, reliable, and accurate, and it will do the job without exposing the original data. That will happen by giving them a lot of proof points. By developing those platforms and giving them the proof points.
Nicolai: And that is how we support our partners, working together on adoption and awareness.
Don: To your last question, we're talking about two different constituencies within these organizations that are different from one another, and both have an interest in synthetic data for entirely different reasons. There is somewhat of a category creation going on in the synthetic data world right now. And synthetic data has been around for at least 50 years, maybe longer in the statistics world. It has a very specific connotation in that world that it doesn't necessarily carry over to all of the others.
When you talk to application engineers, what leads is that we can get compliant data so that you do not have to worry about operating outside of GDPR, and we can get that to you quickly. It's table stakes on the application testing side because what you're trying to do as an application developer is get better test coverage. I'm trying to make sure that my application performs as well as it can in all possible scenarios. I am trying to optimize for code coverage behind the scenes, and this is one technique to do so. Making sure they understand how synthetic data is going to move the needle for the developers is important.
On the AI and ML side, it's even better. The privacy and compliance side of it is table stakes. One of the primary use cases that we have found, especially in large financial institutions because of the lead time required to get access to data, is synthesizing as fast as possible, ignoring the quality, but I need to get something in my hand that looks similar to the structure that my data scientist will get. Six weeks from now, we'll give them the fully instrumented, fully built-out privacy-preserving, statistical property-preserving data set. That gives them agility. Again, I still say that's table stakes because at the end of it, what we can also do on that side is provide better, better business outcomes. Whether that's through rebalancing, or the standardization of imputation doesn't matter, but what we have found consistently is an improvement actually in model quality using synthetic data. It's an easy ROI to prove. You don't have to worry about developer productivity metrics. We can say we improved this churn model by 5%, and churn for us is a $20 million problem, and this helps us to show that we return the business $1 million.
So, that's my two cents on how it needs to be introduced to an organization. How you can successfully bring it in and tell people we're going to do things differently, (and some of it is table stakes) will improve your metrics from a development standpoint and from a business outcome standpoint.
Shruti: My experience dictates that it's always important to take the first step and work in an agile way, and then scale it by understanding it more. I look forward to seeing how we could take that first step within the department, what needs to be done, and how we could leverage this particular technique that can solve a lot of problems if we get it right.