Key observation from the common misconceptions in the previous article about the impact of algorithmic bias and fairness on the industry is that bias is a property of an underlying dataset.
In assessing the bias in data, particularly in banking and insurance, the following questions are critical:
That’s where we are heading. There is a lot to be gained by using AI/ML techniques in finance, in healthcare, in manufacturing; in load management, in routing, in traffic optimisation. But there’s no free lunch here either. Businesses have become inured to having to prove that they have the right to use the data, that they have title to the data, that informed consent has been obtained where required. Businesses are learning that they need to verify that the data is complete, accurate, timely, and data governance is now edging towards maturity. Driven by regulatory pressure, the issue of explainability has become more visible of late, as businesses are mandated to open up “black box” approaches in AI, something easier said than done.
Why is data quality important? Data quality is naturally linked to the performance of a model or a service when it’s facing real people. Not surprisingly, algorithms work in such a way that they find whatever can help them to discriminate, and if there is no other information provided, they will discriminate based on the available information. That is why it is absolutely crucial to collect and analyse as much data as possible to make a clear statement of what causes different performance of an algorithm.
To check whether this model’s outcome was right there has to be an experiment which is to replace that attribute with an alternative and then check the results in a few years... Unfortunately, that is an expensive experiment to do in reality. Luckily, it is possible to conduct it on historical data using the so-called cross-validation techniques.
Explainability of the decision process will, over time, be extended to cover the burden of proof that the datasets used were themselves "fit and proper.”
We are heading towards a place where businesses will have to prove that the datasets used to train their AI/ML capabilities are in themselves fit and proper. The datasets will have to be tested for bias, and, when found, the bias will have to be corrected. At a level of abstraction this is similar to the back- and stress-testing of risk models in capital markets.
It goes without saying that the best decisions would be made by an advanced algorithm which had access to as much data as possible - about every individual on the planet, even - as its training data set. In statistics, a universal algorithm such as this is often referred to as an oracle. A dataset like this would in fact be unbiased as it contains extensive, clear information that has not been polluted by data collection or data preparation techniques. To replicate this ideal ‘oracle’ with a local algorithm in a local setting, it’s important to ensure that the dataset used locally also represents this global knowledge, i.e., it has the same proportions of important attributes, correlations, etc. as the wider population. If the distribution of categories in a global setting is different from the distribution of categories in a local setting, that means the local dataset is biased towards the differing category, so this category should be removed from the analysis, as it may cause unfair decisions.
How can data scientists check whether their local dataset is biased? The ‘ideal’ global dataset is not easily available for comparison, with potentially only a few examples in the hands of governments, but even governments often don’t have access to it. As a result, regulated financial institutions often simply remove unbalanced categories in data and use simple algorithms to fill in the gaps - as an example, this is how European countries handle their credit scoring data systems.
All datasets have biases. In fact Machine Learning relies on finding those biases. But the biases will themselves have to be reasonable and justified, as inherent properties of the data rather than inherited via collection or classification frailties. The datasets will then have to be tested for such bias, and, when found, the bias will have to be corrected.
Furthermore,
We have an abundance of data sources and sensing capabilities; we have affordable price points for the infrastructure required to capture, store and process that data; we have the tools required to glean signal from that data, to refine the models by “learning”. All this is getting better every day.
As a next step, we will need to get better at “fit and proper” testing for the datasets. A key component of that is to test for bias.
It is considered a data bias when the target variable distribution of a group of discriminated samples is significantly different to the whole population. This difference is evaluated in two different ways:
Two scores are given to each dataset, one that quantifies distance bias, and a second one that quantifies classification bias.