en de

Online Magazine

Bias: On the importance of flawless data sets

If the data underlying an algorithm has been collected as heterogeneously as possible and under pure conditions, we can speak of flawless data sets. But what does this mean exactly, and what are the consequences if the data's clean slate gets stained?

by Luca Furrer

People are at the center – especially when it comes to digitalization. At first glance, this seems paradoxical: the more digitalization, the less people are needed, one would think. But if you take a closer look, the opposite is true: Digitalization is fundamentally about data and its profitable use. However, this data must first be generated. And (among other things) this is where humans come into play. An artificial intelligence (AI) does not create new data, but learns on the basis of existing data that we humans have provided. If we now choose "unclean" data as the learning basis for the AI, this will lead to correspondingly unclean results.

Let's take the example of the recruiting process: A study by ETH Zurich, which examines the behavior of recruiters on the largest job platform in Switzerland, "Job-Room", comes to the conclusion that (contrary to the opinion of many) digitalization does not increase discrimination in recruiting. This is due to the banal fact that discrimination in this process existed long before the use of digital tools. You can see this, for example, with ethnicity: if you replace the photo, name or nationality in the CV – all three pieces of information say nothing about the person's qualifications for the job – and replace them with information that matches that of the recruiter (same nationality, same skin color, similar-sounding names), the likelihood that the person will be invited for an interview increases. Now, if you train an algorithm based on this biased behavior, the results of the AI will also be unfair and "biased."

Fears regarding AI are mostly aimed at an artificial intelligence taking on a life of its own and no longer being under control. This fear has a true core – keyword "black box".

From here, let's briefly jump into the topic of "artificial intelligence" and "algorithms". The critical voices against AI are no longer as loud as they used to be, but they are far from being silenced. The doubts and fears are mostly aimed at artificial intelligence taking on a life of its own and no longer being under control. This fear has a true core – keyword "black box". A black box has arisen when it is no longer possible to understand how an AI got from its input to its output. In sensitive areas, like the recruiting process mentioned above, such black boxes must be prevented.

Bias arises in the underlying data

However, an AI cannot simply emerge from and learn all by itself. At the beginning of every AI, there is a human being who, on the basis of various inputs, tells the algorithm what the AI should learn and how it should develop. If we have an AI that makes biased decisions, we need to look closely at what "biases" are already present in the very original data and, at most, in the programmers themselves.

Two examples: In the documentary "Coded Bias," MIT graduate student Joy Boulamwini recounts how she examined facial recognition software and found that the software's founders were all white men. Of course, the inventors had no intention of programming biased or even racist software. They simply trained the AI with the data available to them – perhaps first and foremost with photos of friends and families. In the end, they had an AI that could recognize faces of white men with over 95 percent accuracy. However, it was only able to recognize the faces of black women with just under 70 percent. As I said, it was almost certainly not the founders' intention to develop software that had difficulty identifying faces with dark skin. They simply trained their AI with data that did not cover the entire spectrum of people. This indifference resulted in so-called "biases" – errors in the AI system – that can have fatal consequences as the artificial intelligence continues to learn.

Another example is that of the "Twitter trolls": bots that were "let loose" on the messenger platform and immediately turned into racists. How did this happen? Quite simply, they learned from what was already on the platform, internalized this and became masters of their trade, so to speak – in this case, far-right racists who spread nasty hate speech. All the messages were based on tweets that had originally been written by humans. This means that an AI does not become a racist or a far-right on its own, but simply learns from us humans and adapts what it learns. This is why it is so essential what data is used to train an algorithm. If this data is already "biased", i.e. not heterogeneous enough, an AI learning from it can never act without bias. For this, the AI would need a consciousness, whereby it could question its actions. Since this is not the case, this is up to us humans. An AI is only as fair and unbiased as we are.

The so-called "Twitter trolls" learned from what was already on the platform, internalized this, and became masters of their craft, so to speak – in this case, far-right racists who spread nasty hate speech.

In the US, the Federal Trade Commission (FTC) is on the trail of such unfair algorithms, which make unfair decisions based on "unclean" data. Their work uncovered, among other things, unfair behavior in the programming of the app "Kurbo" (source: Destroying personal digital data). The FTC staff found that some of the data sets on which the app is based were collected illegally. Namely, those data of users under 13 years old. According to the law, the operators of the app would have needed consent from the parents of these users, but they had not obtained it. Therefore, the company behind the Kurbo app had to destroy all illegally collected data as well as the algorithm that worked with this data.

Actually, this is easier said than done: when you train an algorithm based on data sets, there is no longer just one data set to search for and then simply delete. This principle can be demonstrated with a paint box: We have three small color pots: red, blue and yellow. To paint a green meadow on the paper, we take some blue paint and mix it with the yellow. The result is a bit too light for us, so we add red and get a darker shade of green. This color mixture is now perfect for our meadow. In retrospect, however, we can't quite tell how much color we took from which pot to get exactly this green. It's much the same with the data sets on which an algorithm is based: For example, the marketing department takes part of data set A and mixes it with data set B, while the sales people mix another part of data set C to a part of data set A, and so on. Thus, data from a specific dataset may find itself in various new datasets and algorithms. Making specific data entirely disappear afterward borders on an impossibility.


Figure 1: Deleting individual data from all algorithms after the fact is almost impossible – like tracking the exact color ratio in a painting. (Photo Credit: Susan Wilkinson on Unsplash)


Systems based on machine learning or other forms of artificial intelligence are increasingly taking over important decisions. All the more crucial that we build systems we can trust. But how?

This article shows that, in fact, trustworthy AI systems depend on the same qualities as human decision makers.

Responsibility lies wherever data is collected

In addition to the almost immeasurable effort to remove specific data, its deletion, as well as the abandonment of individual algorithms, impacts the entire application or user experience. The operators of the Kurbo app had to learn this the hard way. Hard not only because of the sheer effort involved in deleting the data, but also because the regulatory sanctions involved the loss of an important competitive advantage. This is because certain algorithms are responsible for the user experience in the application or on the platform. The better this algorithm is, the better the user experience will be – and already you have a competitive advantage over the competition.

So in essence, it's about the data that ought to be 100 percent correct and as heterogeneous as possible. Responsible for this flawless data are we humans – because an algorithm has neither a consciousness nor a sense of justice, which enables it to insert only correct data. Although this is a tech issue, it's not just programmers who have a responsibility, but also project managers, CEOs – in short, every party in the entire company who collects data. As a marketing department, for example, it's worth thinking twice about whether you really want to keep the consent hurdle as low as possible and thus collect as much data as you can, or whether you'd rather make sure that the data you get can be used.

For their part, the programmers can monitor and control the AI's learning process, check new data or make sure they are "good" data sets, and thus see that the data doesn't get its white vest stained. You can also regularly check the AI to make sure it is still behaving as expected and has not drifted.

The challenge of fair AI cannot be met by technicians or management alone. It requires the cooperation of all stakeholders and accompanies the AI from the start of the project to the end of its life.

Your contact


AI in business Data analytics Machine learning

How can banks become truly AI-driven?
AI ethics AI in business

TechTalk Audio: Responsible AI & ChatGPT
AI ethics AI in business Machine learning

TechTalk Audio: Responsible AI in automotive