AI Systems: Unearthed Bias and the Compelling Quest for True Fairness | by João Areias

Artificial Intelligence (AI) is no longer a futuristic concept — it has become an intrinsic part of our lives. With its pervasive influence, it is crucial to establish ethical guidelines to ensure responsible AI. To ensure the responsible use of AI, we need strict criteria of fairness, reliability, safety, privacy, security, inclusiveness, transparency, and accountability in AI systems. In this article, we will delve deeper into one of these principles, fairness

Fairness is at the forefront of responsible AI, implying that AI systems must treat all individuals impartially, regardless of their demographics or backgrounds. AI solutions must be designed to avoid biases based on age, gender, race, or any other characteristic. Data used to train these models should be representative of the diversity in the population, preventing inadvertent discrimination or marginalization. This seems like an easy job, after all, we are dealing with a computer, and how the heck can a computer be racist?

Algorithmic Bias

AI fairness problems arise from Algorithmic bias, which is systematic errors in the model’s output based on a particular. While traditional software is comprised of algorithms, machine learning models are a combination of algorithms, data, and parameters. It doesn’t matter how good an algorithm is, a model with bad data is a bad model, and if the data is biased the model will be biased. Some of the ways bias can be introduced in a model are:

Hidden biases

We have biases, there’s no question about that, stereotypes shape our view of the world, and if they leak into the data they will shape the model’s output. Today, July 17, 2023, I asked Google Translate to translate some professions from English to Portuguese. The translation of professions such as teacher, nurse, seamstress, and secretary can be seen with the Portuguese feminine pronoun “A” indicating the profession is practiced by a woman (”A” professora, “A” enfermeira, “A” costureira, “A” secretaria), while professions such as professor, doctor, programmer, mathematician, and engineer are preceded with the Portuguese masculine pronoun “O” indicating the profession is practiced by a man (”O” professor, “O” médico, “O” programador, “O” matemático, “O” engenheiro)

Bias in the translation from English to Portuguese

While GPT-4 has made some improvements, and I could not replicate the same behavior with my short quick tests, I did replicate it in GPT-3.5

Unbalanced classes in training data

In the documentary Coded Bias, MIT computer scientist Joy Buolamwini exposes how many facial recognition systems would not detect her face, unless she wore a white mask. This is a clear symptom that the dataset heavily underrepresents some ethnic groups. This is not a surprise, as the datasets used to train these models are highly skewed, as demonstrated by FairFace [1]. The misrepresentation of the group proportions can lead the model to ignore important features of misrepresented classes.

Racial compositions in face datasets. (Source: FairFace)

Data leakage

Consider an electricity company creating a model to aid in bad debt collection, as a data-conscious company, they decide not to include, name, gender, or any personally identifiable information on their training data, but instead aggregate the clients based on their neighborhood. Despite the efforts, the company has also introduced bias as race and neighborhood are highly correlated. Data Leakage can occur every time a model can learn undesired features indirectly from the train data. In the example case, a model can learn to discriminate against races based on neighborhood data.

Detecting Fairness problems

There is no clear consensus on what fairness really means, but there are a few metrics that can help. When designing an ML model to solve a problem the team must agree on the fairness criteria to use based on the potential fairness-related problems they may face. Microsoft offers a great checklist to ensure fairness is prioritized in the project [2]. Some of the metrics are:

Demographic Parity: This metric asks if the probability of a positive prediction for someone from a protected group is the same as for someone from an unprotected group. For example, the probability of an insurance claim being classified as fraudulent is the same regardless of the person’s race, gender, or religion.
Predictive Parity: This metric is all about the accuracy of positive predictions. In other words, if our AI system says something will happen, how often does it actually happen for different groups? For example, if a hiring algorithm predicts that a candidate will perform well in a job, the proportion of candidates who actually do well should be the same across all demographic groups. If the system is less accurate for one group, it could be unfairly advantaging or disadvantaging them.
False Positive Error Rate balance: This metric is about the balance of false alarms. If the AI system is making a prediction, how often does it wrongly predict a positive outcome for different groups? For instance, in a credit card fraud detection system, a false positive would be when it flags a legitimate transaction as fraudulent. The False Positive Error Rate should be balanced across different demographic groups — it would be unfair if innocent transactions by people of a certain ethnicity are more likely to be falsely flagged as fraudulent.
Equalized odds: This metric is about equal opportunity. It demands both equal true positive rates and false positive rates across groups. In essence, it combines the demands of Predictive Parity and False Positive Error Rate Balance. For a medical diagnostic tool, for example, the rate of correct diagnoses (true positives) and misdiagnoses (false positives) should be the same regardless of the patient’s gender, race, or other demographic characteristics.
Treatment equality: This metric looks at how mistakes are distributed across different groups. Are the costs of these mistakes the same for different groups? For instance, in a predictive policing context, if two people — one from a protected group and one from an unprotected group — both don’t commit a crime, they should have the same likelihood of being mistakenly predicted as potential criminals.

Addressing fairness

While fairness must be in the mind of every Data Scientist throughout the entirety of the project, there are, the following practices that can be applied to avoid problems:

Data collection and preparation: Ensure your dataset is representative of the diverse demographics you wish to serve. Bias can be addressed at this stage by various techniques such as oversampling, undersampling, or generating synthetic data for underrepresented groups.
Model design and testing: It is crucial to test the model with various demographic groups to uncover any biases in its predictions. Tools like Microsoft’s Fairlearn can help quantify and mitigate fairness-related harms.
Post-deployment monitoring: Even after deployment, the model should be continually monitored to ensure it remains fair as it encounters new data. Feedback loops should be established to allow users to report instances of perceived bias.

For a more complete set of practices, one can refer to the previously mentioned checklist [2].

References

[1] FairFace: Face Attribute Dataset for Balanced Race, Gender, and Age

[2] AI Fairness Checklist — Microsoft Research

Source link