IN COLLABORATION WITH THE ADGM DIGITAL LAB

Enabling Innovation With the Synthesized Financial Data Ecosystem

You can also download this report
Download as PDF
EXECUTIVE SUMMARY
In collaboration with the ADGM Digital Lab, Synthesized has provided the community with a financial ecosystem data product, a unique source of comprehensive financial data for data science and ML engineers teams to validate malicious actor detection solutions and other use cases. This way teams are able to focus their time and attention on the more strategic part of their projects rather than on getting or curating data.

As the product improves, more use cases can be incorporated and the ecosystem will deliver even greater value.

Introduction

When working with data and software, fast iteration and quick feedback is essential for staying effective. Initiatives that don’t conform to this can be expensive and struggle to improve. Building on this understanding, the DataOps mentality prioritises fast, iterative testing processes by delivering high quality data products on demand for analytics. Enabling teams to dynamically react to changing requirements, experimentation and reproducibility is considered paramount to success.

Yet, even within such framing, data collection still remains a large bottleneck for many companies due to issues such as privacy concerns, high collection costs and slow approval procedures.Large financial institutions and FinTech startups alike are equally stymied by these issues as data is usually expensive to collect and impossible to share.

The Synthesized Financial Data Ecosystem provides teams immediate access to a sophisticated, intelligent collection of fraud data products, which in turn allows ADGM Digital Lab participants to rapidly test different data hypotheses and validate solutions in a matter of minutes.

A quick intro to data as a product

At Synthesized we’re excited by the concept of “Data as a Product”, with the belief that
the data team in each organisation plays a key role in facilitating good decision making and application-building through the data they provide. Data products are owned by specific data owners, have implicit Service Level Agreements, and are visible and understandable to other teams, owners are confident in its high quality and utility to improve wider initiatives.

At Synthesized we support a decentralized approach to data access and allow data teams to quickly convert their data into a safe and easily shareable data product.

The Synthesized platform allows for:

Driving innovation in the ADGM marketplace

Synthesized partnered with the ADGM Digital Lab initiative, which provides a platform for financial institutions and FinTech innovators to come together to experiment on solutions in a fully digital environment, as well as a Digital Marketplace for open collaboration between innovators, institutions, and regulators in the financial ecosystem to facilitate testing and adoption of innovative digital financial products and services that can benefit the industry in the region.

Due to the cost of collection and sharing, publicly available, comprehensive financial data is difficult to find, yet would be a crucial shared resource for the Digital Lab’s community.ADGM looked for a solution that would enable collaboration on the platform particularly for the use case of fraud detection (money laundering and terrorism financing).

While numerous resources document the most common patterns with such illegal activities, not much data is publically available for use.

In addition, available data usually represents some small subsection of the financial activities taking place and so creating solutions with a thorough understanding of the problem is next
to impossible. The lack of sample data and low signal issues within data that does exist present immense challenges when trying to build innovative solutions for fraud.

To solve this specific challenge, our team has built a shared, comprehensive synthetic database, which can be accessed here. The Synthesized team also has the capability to provide additional databases for different use cases, as required by the ADGM Digital Lab community.

“Access to high-quality synthetic data is critical for the effective utilization of a testing environment such as the ADGM Digital Lab and we are excited to see how Synthesized’s database, which replicates the patterns and transactions within an entire financial ecosystem, will be used by members of our Digital LabCommunity.”

— WAI LUM KWOK, SENIOR EXECUTIVE DIRECTOR, ADGM FINANCIAL SERVICES REGULATORY AUTHORITY.

Our approach

Firstly, to gain a better understanding of what the financial ecosystem looks like, here is the basic structure of the database

  • A query can generate data from this whole database or specify some subset like: “Show me a selection of business accounts with their loans” or “Show me a time-series of individual customers making transactions and their risk profiles”
  • Upon the query, the Synthesized data will then be immediately available to the user. The generation process is backed by the data we have in-house and informed by common patterns in customer and business behaviour. Some plots of this data is extracted below, for reference:

Our vision of a fully customisable

We prioritise making the generation process highly adaptable — if you have specific requirements for the type of data you are interested in, we are open to modifying this process in order to best serve your data needs.

No more inflexible process to work around — we welcome agile processes, adaptable to any use case — And that is the truly exciting and distinct element of our financial data ecosystem!

Furthermore, the data generation can be informed by specific datasets — making this a powerful tool at your disposal. Users can now augment their existing datasets with more rows to increase the coverage of their data, and add new columns into their existing datasets allowing entirely new initiatives!

At Synthesized we enable companies to quickly start new projects and find out if their solution works within weeks of the beginning. Data should be there to enable businesses, not to hinder them.

“Our sophisticated AI technology revolutionises the way organisations work by automatically creating unlimited, precise, intelligent data that is fit-for-purpose and free of the usual bias. At Synthesized, our mission is to help businesses untap their full potential and we are proud to be supporting ADGM in moulding Abu Dhabi’s digital economy in the same way.”

— NICOLAI BALDIN, CO-FOUNDER AND CEO, SYNTHESIZED

Incorporating more data

A unique characteristic of this data product is its ability to augment itself with specific user data. For example, the initial data set includes loan data such as this Kaggle dataset and you’re interested in generating more transaction data.

Column
Type
Description
Loan_ID
String
An identification string
loan_status
Category
Whether the Loan has been paid off, is being collected or has been paid off after some collection efforts.
Principal
Integer
Principal loan amount at origination
terms
Integer
Payoff schedule in days.
effective_date
Date
When the loan originated.
due_date
Date
Date in which loan is fully paid off.
paid_off_time
Datetime
When the loan is actually paid off.
past_due_days
Integer
How many days a loan has been past due.
age
Integer
Customer Age
education
Category
Customer level of education.
age
Category
Customer gender.

The ecosystem is already generating loan data and information such as different interest rates on each loan is added, with distinct columns created. Customer information is stored in a separate but linked table, which allows for the datasets to be updated without issue.

Our technology generates new data similar to the loan dataset, and can combine the previous user requirements of terrorist financing activities with the new loan data. The users can now share and collaborate on the data without worrying about revealing information from the original dataset, thanks to Synthesized’s privacy feature capabilities.

Although our original table contained no information about interest rates, the ecosystem already generates these and can apply to provide users with additional useful information they wouldn’t have access to originally!

This generation fits into the rest of the ecosystem also — it will affect how entities transact with one another and preserve the malicious behaviour use-case. This way, teams can augment the ecosystem with a particular dataset they are interested in and the ecosystem will adapt to fill in the blanks!

loan_status
principal
terms
effective_date
due_date
...
interest_rate
COLLECTION
800
15
12/09/2016
26/09/2016
...
2.29
COLLECTION_PAIDOFF
1000
30
09/09/2016
10/11/2016
...
2.43
PAIDOFF
1000
30
13/09/2016
13/09/2016
...
2.60

Delivering quality data

At Synthesized, we have full confidence in the quality of the data we provide and the ability to replicate the patterns, thanks to our data evaluation process. It uses a wide range of statistical techniques and Machine Learning models to validate and ensure that the Synthesized data can be used as a drop-in replacement for the real thing.

We incorporate user requirements into the generation process, such as documented user common behaviours. However, we firmly believe that the best test of the data is its actual usage by fellow data scientists in the ecosystem.

Further notes

While we are confident in the stability of the existing generation process for a comprehensive financial ecosystem, for some unique case studies, more specific data that accurately captures the nuances of each certain situation is needed. We are eager to support a wider scope of use cases and make innovative ideas come to life much easier, a world where data will be easily accessible, and building it will only require an intuitive configuration with the ecosystem.

As the ecosystem grows, we are positive the value it provides can be further augmented.

Conclusions

This is only the starting point, and we seek to iterate and incorporate more data points into the generation process, thus being able to support more use cases and making the process even more adaptable. Finally, we want to provide a seamless integration procedure for fine-tuning the generation to any specific dataset. This could stop initiatives from being hindered at the first step and empower them to embrace the agile DataOps mentality!