Organizational and Third-party Data Sharing - Blog

Introduction to data sharing

Organizational and third-party data sharing have become even more important in recent years, especially with the exponential growth of data generated and stored within organizations. According to Statista, between 2020 to 2025, global data creation is projected to grow to more than 180 zettabytes. As organizations gather large amounts of data, the value of the data grows, and sharing valuable data can yield many benefits for them. However, sharing data has its limitations, such as potential privacy and security breaches. Today we will discuss the importance of data sharing (within the organization and third-party data sharing), traditional techniques used to share data, and the drawbacks of each of these techniques. Also, we will discuss the potential of synthetic data as a key enabler for fast, safe, and secure data sharing in the future.

Why is data sharing an important topic?

Data sharing is a pivotal topic because it enables businesses to gain new insights and expand their knowledge in ways that may not be possible using their data. In financial services, data sharing can result in more effective risk management, fraud detection, and decision-making to improve customer experience. For example, credit card companies can share their transaction data with fraud teams across existing silos to understand consumer spending patterns, and identify potential fraud. Additionally, insurance companies can share their claims data to assess risk more effectively and optimize their policy pricing appropriately.

Sharing data with third-party vendors can provide opportunities for companies to gain access to new markets, identify growing trends and opportunities, and develop new products and services. Nevertheless, organizations must ensure that they are sharing data in a compliant and ethical manner, to protect the security of their customer data.

Exploration of Third-Party Data Sharing

Third-party data sharing plays a crucial role in modern business strategies by offering companies access to new markets, enhanced capabilities for product development, and opportunities for innovation through collaboration. Unlike internal organizational data sharing, which primarily circulates data within an organization, third-party data sharing involves exchanging information with external entities, ranging from vendors and partners to independent researchers and other businesses.

One of the most significant benefits of third-party data sharing is the potential to tap into external expertise and technologies that can transform raw data into actionable insights, thereby driving business growth and innovation. For instance, a retail company might share customer shopping data with a marketing firm to create targeted advertising campaigns, or a healthcare provider might collaborate with a technology firm to analyze patient data and improve treatment plans.

However, third-party data sharing comes with its own set of challenges and complexities, primarily concerning security and compliance. Organizations must navigate a landscape filled with potential data breaches, privacy concerns, and stringent regulatory requirements such as GDPR in Europe or HIPAA in the United States. It is crucial to establish robust security measures like encryption, access controls, and regular audits to protect sensitive information from unauthorized access or leaks.

The methods employed for third-party data sharing are varied and must be chosen based on the specific needs and context of the data exchange. Common methods include:

APIs (Application Programming Interfaces): APIs allow for seamless and real-time data exchanges between systems, making them ideal for scenarios where data needs to be frequently updated or accessed on-demand. They are particularly useful in financial services for sharing transaction data or credit scores.
Cloud-based platforms: These platforms offer scalable and flexible data storage solutions that can be accessed from anywhere, facilitating collaboration between geographically dispersed teams. Cloud providers typically offer robust security measures, but organizations must remain vigilant about where their data is stored to avoid legal and compliance issues.
Direct data partnerships: These involve formal agreements with third parties to share specific data sets under clearly defined terms. Such partnerships must be governed by detailed contracts that specify data usage rights, privacy protections, and compliance obligations.

To ensure legality and trust in third-party data sharing, it is imperative to draft comprehensive data-sharing agreements. These contracts should clearly outline the scope of data use, the responsibilities of each party, and the measures in place to protect data privacy and integrity. They also need to address compliance with relevant laws and regulations, which can vary significantly across different regions and industries.

By carefully managing these aspects, companies can leverage third-party data sharing to not only enhance their operational capabilities but also maintain compliance and uphold high standards of data privacy and security. This strategic approach allows for the expansion of business horizons while safeguarding the company's and customers' critical information.

Traditional data-sharing techniques (and common pitfalls)

The key techniques for organizational data sharing can be broken down into three main categories, including:

How data is shared

Sharing database keys: Data can be shared by sharing database keys, which are effectively passwords for a database. Sharing keys with third-party recipients enables them to access the corresponding database but holds the risk of allowing users to transfer the data. This technique helps organizations share data for analysis and cooperation.
‍
However, the management of permissions and access must be controlled carefully, to ensure that highly sensitive data is not leaked to those who are not authorized to access it.

Emailing CSV files: An organization’s data can be shared by emailing Comma-Separated Values (CSV) files as an attachment. This method is common when sharing data within an organization due to its ease of use. The data can be exported from a data warehouse or database into CSV format, and then attached to an email that is sent to the recipient.
‍
Nevertheless, there are many potential risks such as errors in data entry and formatting, and security and privacy risks. It loses valuable information such as schemas and has no access to version controls. There are often no checks to see if the data has been corrupted at all. Most importantly from a security standpoint, once a CSV file has been emailed, it is incredibly easy for that file to be forwarded without the original organization’s knowledge or consent.

Cloud-based storage: The concept of cloud computing has evolved exponentially since its early inception in the 1950s and 1960s. Cloud-based storage allows organizations to share data by providing a centralized repository for data storage that can be accessed by authorized users from anywhere. This enables data and engineering teams, as well as other departments, to access data without complex network configurations. The use of the cloud is very scalable as organizations can increase storage capacity to fit their needs. Cloud providers have strong security measures, such as access controls and encryptions to avoid data access breaches.
‍
However, there are a few drawbacks. Firstly, lots of organizations still have not moved to the cloud, or are in the middle of large and expensive cloud migration projects. Secondly, not all organizations are comfortable with putting their data in the cloud yet. Finally, great care must be taken when engaging in third-party data sharing via cloud-based storage to fully understand exactly what jurisdiction the cloud infrastructure is sitting in. If a user accidentally saves data to a cloud storage system in another geographic region or jurisdiction, this can class as a movement of data outside of a jurisdiction and can be deemed as a data breach.

APIs: The use of Application Programming Interfaces (APIs) allows organizations to engage in third-party data sharing. APIs enable systems to communicate with each other, enabling secure and efficient data sharing compared to other techniques. In financial services, APIs can enable the sharing of information such as transaction history or credit scores with Fintechs or other financial institutions. This can help recipients of shared data to increase innovation, but there are concerns about data privacy and security compliance when sharing data through APIs. If API settings are configured incorrectly data could be leaked during transit.

Techniques used for security improvement

Pseudonymization: This is a technique used to enhance the privacy of sensitive data so it can be shared with third parties, without revealing the identity of the data subject. Pseudonymization involves substituting personally identifiable information (PII) such as names and addresses with a token or pseudonym instead. Once these privacy enhancers have been added, data can be shared across organizations or with third parties with mitigated risk of data privacy breaches.
‍
However, pseudonymization is susceptible to linkage attacks.

Information redaction: Information redaction techniques are applied to mask or remove sensitive information in a dataset. Redaction techniques include replacing all but the last 4 characters of a credit card number, or completely removing sensitive columns altogether. This allows organizations to share data cross-divisionally or with third parties without disclosing sensitive information. The process of redaction can be completed manually, but it is also able to be automated.
‍
While redaction is a technique that enables data sharing by masking sensitive data in compliance with regulations, it is not always effective, because it can still be susceptible to linkage attacks.

Enabling third-party data sharing

Data sharing agreements: These are agreements or contracts defining the terms and conditions that data can be shared under. The contracts define the usage rights of the data, restrictions, and responsibilities of the parties involved. Usually, the agreements address data privacy, compliance, and intellectual property rights transparently, and work in tandem with privacy and security enhancement techniques.

However, data-sharing agreement comprehensiveness and enforcement are often dependent on a company’s size, industry regulations, and geography.

Nevertheless, legal regulations and data movement laws apply strongly to data sharing. Some datasets are not allowed to move across geographies (e.g. healthcare datasets often can’t leave the country they were generated in), and some of the existing techniques could be in breach of regulations if they are not adhered to properly.

Synthetic data for data sharing

Synthetic data has emerged as a valuable tool to enable data sharing between silos. Tools such as the SDK are safer than using pseudonymization and information redaction techniques, but provide the same quality of data. Synthetic data produced by the SDK has no direct 1-to-1 mapping with the original data points and provides enhanced data protection features such as tunable differential privacy. This allows organizations to determine how much information is learned from any single row in their datasets by putting strict mathematical constraints on the models during training.

Synthesized’s Governor provides a platform for role-based access control to raw and synthetic data for internal data sharing, enabling easier and more transparent control of raw and synthetic data products internally.

Synthetic data for cross-divisional data sharing (Internal sharing)

Synthetic data is a valuable solution to enable cross-divisional data sharing within financial services. For example, a large bank with multiple business lines, such as commercial, investment, and retail banking is siloed into its divisions with little data sharing between them.

There are several reasons why it is difficult to share data between divisions within a financial services organization. Primarily, there can be concerns about data privacy and security risks, particularly when it comes to sharing sensitive customer information. This ties into the rapidly-changing landscape of regulatory and legal constraints, which restrict the sharing of particular types of data between divisions. Another reason why it may be difficult is that different divisions may use different technologies or data systems, which makes it difficult to integrate data from various sources.

The use of synthetic data can enable a division to generate a synthetic dataset that is representative of an original dataset. The synthetic datasets can be safely shared between divisions without risking the privacy of sensitive customer information. For example, the retail banking division of a large financial institution can generate synthetic data that is representative of customer behaviors, and this data can be shared with the institution's investment banking division to help inform investment decisions. This increases data access and data utility that is not achievable with original data.

Synthetic data for data monetization for third-party data sharing

Data possess tremendous value, and for financial services companies with vast amounts of data at their disposal, there is great potential for that data to be monetized. In recent years, there has been a growing discussion around the use of synthetic data for data monetization. The privacy-preserving features in synthetic data enable faster and safer sharing of information between silos, and to other data controllers such as third parties, without worrying about the restrictions that apply to sharing data containing personally identifiable information (PII).

Increased efficiency from the speed and improved safety of sharing synthetic data could allow for the creation of new revenue streams through the sale of synthetic versions of banking and insurance data to third parties. As the data monetization use case has become a more popular topic in the context of synthetic data, financial services companies have begun exploring its potential going forward.

Using synthetic data for data monetization could provide several business benefits such as:

Increased revenue: Synthetic data can be used as a representation of real and valuable datasets, as they maintain the statistical properties of original data. The synthetic datasets can subsequently be repackaged into assets that a financial services company can sell to a third party through third-party data sharing agreements.

Decreased risk compared to selling raw data: Use of synthetic data mitigates the risk of sensitive data leaks, thus complying with regulatory requirements on sensitive customer data

However, there are challenges associated with using synthetic data for data monetization, including:

Data quality: Data must be representative of original data sets (to provide value and insights similar to the original data), but should not risk the security of sensitive customer information. The quality of some synthetic datasets may be reduced as certain privacy-preserving features are activated and strengthened

Regulatory standards: Regulations on the PII classifications of synthetic data for third-party data sharing are not yet concrete, meaning it is not yet clear how and when the sharing of synthetic data is compliant or not. Equally, as regulation continues to evolve in financial services, the future of data monetization in the context of regulatory standards is not yet set in stone

While synthetic data has potential benefits for data monetization, financial services firms must carefully consider the potential challenges and limitations before deciding to use it for this purpose.

Sharing synthetic data for academic research papers (Third party sharing/research)

Synthetic data’s use in academic research is increasing in importance due to its potential to overcome the problems associated with traditional data access and sharing. The use of synthetic data generation techniques enables researchers to generate data that is statistically representative and has the same correlations as the original data, without jeopardizing privacy or confidentiality. As a result, researchers have the opportunity to explore complex research questions that may have been difficult to address using original data alone.

For instance, in healthcare, the idea of using synthetic training data to help train machine learning models to detect early-stage lung cancer from CT scans, without compromising patient privacy and confidentiality when hospitals or medical researchers share the data has been highlighted.

Moreover, synthetic data can be used to address the issue of reproducibility in academic research, by allowing researchers to share data that can be used by other researchers without compromising the original data source. Nevertheless, the use of synthetic data in academic research and third-party data sharing provides a promising avenue for addressing the challenges of data access and sharing while advancing research in many fields.

Conclusion

To conclude, data sharing is pivotal for organizations that want to harness the potential of data to enhance decision-making, revenue, and innovation. But, data sharing does have significant risks, such as data privacy and security breaches, which can result in legal issues and significant costs. How data is leveraged for sharing must be cautious and considered thoroughly. Synthetic data is a promising enabler for secure data sharing, as it provides organizations with an opportunity to stay within the parameters of regulations. As synthetic data generation continues to evolve, organizations are likely to consider its use to enable fast, secure, and compliant data sharing.

‍

FAQs

What are the specific risks associated with third-party data sharing, beyond general data breaches?

Beyond the obvious risk of unauthorized access leading to data breaches, third-party data sharing introduces unique concerns. These include misuse of data for purposes not agreed upon in the contract, accidental data alteration by the third party's systems, and legal complications if the third party operates under different data protection regulations than your organization. It's crucial to meticulously vet potential partners and have robust agreements in place.

Can synthetic data truly replace the need for third-party data sharing in all cases?

While synthetic data offers a promising solution for many third-party data sharing scenarios, it may not be a perfect substitute in every case. Highly sensitive data or scenarios requiring real-time updates might still necessitate some level of direct sharing. However, synthetic data significantly reduces risk and can be the preferred method for many use cases, particularly when combined with other privacy-enhancing techniques.

How does the rise of third-party data sharing impact data monetization strategies?

Third-party data sharing is revolutionizing data monetization. It opens up new avenues for generating revenue by securely sharing data with external parties interested in insights or analytics. This could involve selling anonymized datasets, offering API access to specific data streams, or partnering with other organizations for joint data-driven projects. Synthetic data further amplifies this potential, allowing for secure monetization without compromising sensitive information.

How can organizations evaluate the trustworthiness of potential third-party data sharing partners?

Thorough due diligence is crucial when selecting third-party data sharing partners. This includes assessing their data security practices, track record with data handling, compliance with relevant regulations, and the robustness of their data sharing agreements. Third-party certifications (like ISO 27001) can provide an added layer of assurance. Additionally, organizations should continuously monitor their partners' performance and have clear exit strategies in case the relationship doesn't meet expectations.