Discuss this article on the DataOps Community on Slack!
Connect with other Synthesized users and directly with our engineers.
Go to slack
Share:
Business
April 11, 2023

Ethics, privacy, and regulations in data

Author:
Synthesized
Ethics, privacy, and regulations in data

Introduction

The protection of data has become a critical component of business operations across various industries. The rise in threats to the privacy and security of personal data has led to the evolution of data privacy laws and regulations across the globe to protect data. Simultaneously, there is a growing focus on the ethical consequences of data usage, and the impact of systemic data bias on the utility and quality of data used by organizations. One way the challenges have started to be addressed is through the use of synthetic data. We will explore the importance of data privacy and protection through regulation, ensuring the ethical use of data and eliminating data bias.

The importance of data privacy & protection through regulation

Data protection regulations have become increasingly crucial in recent years, with the rising threat of cyberattacks, data breaches, and concerns surrounding the ethical use of personal information by organizations. One of the most notable developments on the topic of data privacy was the introduction of the new EU General Data Protection Regulation (GDPR), which came into effect in May 2018. This law applies to all organizations that process the personal data of citizens in EU countries, no matter where they are based. The EU refers to GDPR as the toughest privacy and security law in the world, given its strict guidelines for the right to access personal data, erasure of personal data, and data portability. But laws such as these are vital to ensuring that everyone’s data is used responsibly and fairly. If personal or sensitive data were to fall into the wrong hands, it could be harmful to the individuals that the data belongs to, and the organizations that collect it.

Different laws and regulations are applicable depending on the jurisdiction and even the industry. For example, in Japan, there is the Act on the Protection of Personal Information (APPI), and in California, there is the California Consumer Privacy Act (CCPA). The CCPA, which was established in 2020, gives Californians the right to know what personal data organizations collect if it pertains to them.

Below is a comparison between CCPA & GDPR:

CCPA GDPR
Date established 1st January 2020 25th May 2018
Type Statutory & regulatory Regulatory
Personal data Information relating to an individual, household, or device. (Excludes public info) Individual data for commercial purposes. (Excludes public information)
User rights
  • Right to delete personal info
  • Right to opt out of the sale of personal info
  • Right to know about the access to personal info
  • Right to access personal info
  • Right to delete personal data
  • Right to restrict personal data processing
  • Right to automated data processing
Right to opt-out Yes Yes
Scope For-profit businesses that hold the personal information of Californian residents - and the following criteria:
  • >$25m revenue
  • >50% of revenue from the sale of personal data
  • Buys/sells/receives data of >50,000 Californian residents
Applicable to businesses that hold the data of EU residents
International data transfer No restrictions Requires non-EU recipient country to provide adequate protection: companies complying with similar agreements
Data security No particular requirement but must have good security Requires appropriate security measures according to risk
Enforcer California Attorney General EU Commission, EDPB, member state data authorities
Penalties Up to $2,500 for each violation, and $7,500 for intentional violations Up to €20m or 4% of global annual revenue (highest fee) for severe violations

As the topic of data privacy and protection continues to develop, there is a growing trend in favor of increased regulation in the data space. For example, GDPR has already begun to influence regulations in other regions, such as Brazil’s General Data Protection Law (LGPD). Furthermore, recent scrutiny over TikTok’s data privacy and security, following a congressional hearing with CEO Shou Zi, a £12.7m fine from the Information Commissioner's Office (ICO), and a ban on the use of the app on government devices in some countries have reignited conversations about the need for more comprehensive federal data laws in the US.

Complying with data protection laws can benefit organizations significantly. Economically, it enables organizations to save time and costs, by ensuring that they are within the parameters of the law, which can positively impact a company’s brand and reputation as a custodian of information. Equally, data protection law-compliant organizations can benefit from business process improvements, as they evaluate how they manage customer data through their storage and management processes.

The role of synthetic data in protecting data through privacy & security compliance

Data has become a critical part of a business’s operations, and with that increased need for data comes the increased necessity of data privacy and security compliance. Organizations have to ensure that sensitive data is protected from the threat of unauthorized access, threats, and misuse. In recent years, synthetic data has emerged as a key solution to the issue of privacy compliance when utilizing data.

Traditional anonymization and pseudo-anonymization techniques can be powerful tools to help obfuscate and mask sensitive data attributes, however, they are still susceptible to attacks and there remains a one-to-one relationship between the original and anonymized data. Therefore, anonymized data is susceptible to more advanced attacks such as linkage attacks and attribute inference.

However, the Synthesized SDK utilizes deep generative models, along with mathematical paradigms such as differential privacy, to ensure that there is no one-to-one mapping between a synthetic data point and an original data point. When synthetic data is combined with traditional anonymization techniques, the result is high-quality data that retains the statistical properties and information as the original but is now compliant with an organization's security and privacy standards.

Ensuring ethical data usage

Data ethics refers to the guidelines and principles that govern the organizational use, sharing, and analysis of personal and sensitive data. Not only does this relate to data privacy and security, but also the topic of fairness, transparency, and responsibility when it comes to the use of personal data.

Organizations such as financial institutions are responsible for ensuring that the data they collect is used ethically, and not for discriminatory purposes that can negatively affect consumers, shareholders, or other stakeholders. With the increasing use of data in various industries, including financial services and healthcare, data ethics has become a more important topic for data-driven organizations.

However, there are various threats to the ethical use of data by organizations, including:

  • Lack of transparency - Organizations may not be transparent about how they are using sensitive customer data, as they do not want people to know what it is being used for;
  • Data sharing - This relates to the issue of transparency. Organizations can share data with third parties that have data-sharing agreements in place, but often consumers are not privy to what the third parties will use the data for;
  • Algorithmic bias - Algorithms may be trained on biased data, resulting in a biased algorithm that gives biased recommendations, predictions, or outcomes that can discriminate against a subset of the population.

Organizations need to be aware of these threats and combat them before they arise, to avoid experiencing negative consequences. Synthetic data has emerged as a tool to mitigate these threats. 

Eliminating data bias 

Datasets can be highly imbalanced and contain only a few example data points from specific groups and demographics. There are several reasons for this:

  • Data collection - The way data is collected can result in data imbalance. For example, if data is collected from only one geographic location or demographic group, it may not be representative of the larger population;
  • Sampling - The way that data is sampled can also result in data imbalance. If the sample is too small or not random, it may not accurately represent the population, which can perpetuate imbalances;
  • Data preprocessing - How data is preprocessed can also introduce bias. For example, removing outliers or missing values without considering the reasons for their occurrence can skew the data;
  • Labeling - How data is labeled can also result in data bias. If the labels are subjective or influenced by an individual’s personal biases, it can subsequently impact the accuracy of the model;
  • Data augmentation - This is a technique that could also introduce bias. If the augmentation techniques used are not diverse, it can lead to an over-representation of particular classes;
  • Inherent imbalances - Populations are not homogenous, as they are often made up of varying sizes of subgroups. For example, a country’s population will likely have imbalances in the number of different religious groups, meaning a representative subset of the population will also have these inherent imbalances. Additionally, in financial services, inherent data imbalance can be a result of rare occurrences of particular types of transactions, namely a lack of fraudulent transactions in comparison to legitimate bank transactions.

The use of synthetic data is helping to mitigate the issue of bias within data, artificial intelligence, and machine learning. When training a machine learning model on such a dataset, the resulting performance of the model in production will likely be suboptimal when predicting outcomes for the underrepresented classes in question.

Synthesized can rebalance such training datasets, producing synthetic data with the same correlations between features as the original, but with user-defined distributions of classes and subgroups present. The rebalanced synthetic data can then augment, or be used in place of, the original data when training models, de-biasing the model for underrepresented groups and demographics. Rebalanced and unbiased datasets are the key to accurate analysis and decision-making within organizations, as it gives a better picture of the population that is being studied. Additionally, unbiased data can help to foster responsible AI use and reduce discriminatory practices within organizations by ensuring that data utilized by organizations do not perpetuate systemic inequalities through bias.

Conclusion

The importance of data protection through privacy, compliance, and ethical use cannot be overstated, especially in industries where privacy and trust are vital to organizational reputation. The regulatory frameworks discussed in this article serve as a necessary foundation for protecting sensitive data, but organizations must think beyond basic compliance measures to ensure the responsible and transparent use of customer data. Synthetic data provides the ability to mitigate data bias and address privacy concerns with datasets while enabling organizations to significantly increase data utility to reap business benefits.

Join the community on Slack

Learn about modern DataOps practices and connect directly with your peers, Synthesized users, and our engineers.