Share:
Business
September 2, 2020

Fairness and Algorithmic Biases in Machine Learning and Recommendations to Enterprise

Author:
Synthesized
Fairness and Algorithmic Biases in Machine Learning and Recommendations to Enterprise


Key observation from the common misconceptions in the previous article about the impact of algorithmic bias and fairness on the industry is that bias is a property of an underlying dataset.

Nicolai Baldin argues

In assessing the bias in data, particularly in banking and insurance, the following questions are critical:

  1. How was the dataset collected?
  2. What is the quality of this dataset?
  3. Has it been benchmarked against other larger samples of data?

JP Rangaswami comments:

That’s where we are heading. There is a lot to be gained by using AI/ML techniques in finance, in healthcare, in manufacturing; in load management, in routing, in traffic optimisation. But there’s no free lunch here either. Businesses have become inured to having to prove that they have the right to use the data, that they have title to the data, that informed consent has been obtained where required. Businesses are learning that they need to verify that the data is complete, accurate, timely, and data governance is now edging towards maturity. Driven by regulatory pressure, the issue of explainability has become more visible of late, as businesses are mandated to open up “black box” approaches in AI, something easier said than done.

Scrutinizing data quality and the way data was collected

Nicolai Baldin continues:

Why is data quality important? Data quality is naturally linked to the performance of a model or a service when it’s facing real people. Not surprisingly, algorithms work in such a way that they find whatever can help them to discriminate, and if there is no other information provided, they will discriminate based on the available information. That is why it is absolutely crucial to collect and analyse as much data as possible to make a clear statement of what causes different performance of an algorithm.

To check whether this model’s outcome was right there has to be an experiment which is to replace that attribute with an alternative and then check the results in a few years... Unfortunately, that is an expensive experiment to do in reality. Luckily, it is possible to conduct it on historical data using the so-called cross-validation techniques.

JP Rangaswami adds:

Explainability of the decision process will, over time, be extended to cover the burden of proof that the datasets used were themselves "fit and proper.”

Recommendations to enterprise & idealistic scenarios

JP Rangaswami continues:

We are heading towards a place where businesses will have to prove that the datasets used to train their AI/ML capabilities are in themselves fit and proper. The datasets will have to be tested for bias, and, when found, the bias will have to be corrected. At a level of abstraction this is similar to the back- and stress-testing of risk models in capital markets.

Idealistic “futuristic” scenario

Nicolai Baldin argues:

It goes without saying that the best decisions would be made by an advanced algorithm which had access to as much data as possible - about every individual on the planet, even - as its training data set. In statistics, a universal algorithm such as this is often referred to as an oracle. A dataset like this would in fact be unbiased as it contains extensive, clear information that has not been polluted by data collection or data preparation techniques. To replicate this ideal ‘oracle’ with a local algorithm in a local setting, it’s important to ensure that the dataset used locally also represents this global knowledge, i.e., it has the same proportions of important attributes, correlations, etc. as the wider population. If the distribution of categories in a global setting is different from the distribution of categories in a local setting, that means the local dataset is biased towards the differing category, so this category should be removed from the analysis, as it may cause unfair decisions.

How can data scientists check whether their local dataset is biased? The ‘ideal’ global dataset is not easily available for comparison, with potentially only a few examples in the hands of governments, but even governments often don’t have access to it. As a result, regulated financial institutions often simply remove unbalanced categories in data and use simple algorithms to fill in the gaps - as an example, this is how European countries handle their credit scoring data systems.

Recommendations to enterprise

JP Rangaswami argues:

All datasets have biases. In fact Machine Learning relies on finding those biases. But the biases will themselves have to be reasonable and justified, as inherent properties of the data rather than inherited via collection or classification frailties. The datasets will then have to be tested for such bias, and, when found, the bias will have to be corrected.

Nicolai Baldin continues:

Furthermore,

  • ML is domain-specific and we need to understand the legal context, see the article on the legal consensus regarding biases in the UK and the US and the article on the impact on banking and insurance
  • Besides monitoring decisions made my Machine Learning, we need to monitor data itself and how it was provisioned
  • Besides analysing static one-shot problems, we need to study long-term effects of decisions being made and feedback loops
  • We need establish qualitative understanding of when/why ML is the right approach to a given problem

JP Rangaswami adds:

We have an abundance of data sources and sensing capabilities; we have affordable price points for the infrastructure required to capture, store and process that data; we have the tools required to glean signal from that data, to refine the models by “learning”. All this is getting better every day.

As a next step, we will need to get better at “fit and proper” testing for the datasets. A key component of that is to test for bias.

Appendix:

What is data bias?

It is considered a data bias when the target variable distribution of a group of discriminated samples is significantly different to the whole population. This difference is evaluated in two different ways:

  • Distribution bias. A distribution distance metric is computed between the potential discriminated group and the whole population. If this distance is above a threshold, and the results are proved to be statistically valuable, it is considered a bias.
  • Classification bias. Few classifiers are trained on the training set. The performance of the model in the whole test set, and different potentially discriminated subsets of the test set are compared. If the results differ substantially, the dataset is considered to be biased.

Two scores are given to each dataset, one that quantifies distance bias, and a second one that quantifies classification bias.

What causes algorithmic bias and ML to discriminate?

  • Skewed data sample. A dataset is considered skewed when there is a significant difference in the amount of measurements of outcomes of an ML algorithm for different groups of individuals. It often happens when there are fewer opportunities to observe outcomes that contradict predictions. This may compound over time and impact the algorithm.
  • Tainted examples. Oftentimes, the data is collected by humans or by systems designed by humans, and many humans are biased. A dataset labeled by a biased person will contain some tainted samples, and therefore any model trained on this dataset will preserve human biases.
  • Limited features. If the features that really explain the phenomenon that we are trying to predict are not present in the dataset, the classification can rely on other features that may or may not have a direct relationship with the target. Imagine we try to predict the salary from a dataset that contains only a sample of highly educated white people and a afrian-american sample that couldn’t go to highschool. If the education feature is deleted from the dataset, the classifier will be heavily based on the race, while the cause of the salary discrepancy will most probably be education.
  • Sample size disparity. If the data collection process was not done equally for all sub-samples, the dataset may end up having some small populations. If any of these populations are not large enough, the model may not be able to approximate its behaviour properly and end up making wrong decisions on certain sub-samples.
  • Proxies. Some features are highly correlated among them, so even if we remove the sensitive columns from a dataset, the information that we are trying to hide may still be there. A very explicit example would be when we remove a “gender” column from a dataset, but we keep the “marital status”. The dataset will still be able to know that “wife” belongs to gender female, and “husband” belongs to male. These proxies are not easy to detect, as they require some deeper look into the dataset.