Bias & Fairness in AI

Bias is not a bug you can grep for

By now you know that a model does not contain rules someone typed in; it contains parameters fit to data so as to minimize a loss. That single fact reframes the whole conversation about fairness. When a hiring model favors one group, there is usually no line that says `if female: reject`. The model is doing exactly what it was trained to do — reproduce the patterns in its examples. [[algorithmic-bias|Algorithmic bias]] is the name for the systematic, group-correlated errors that result, and the uncomfortable truth is that they are an emergent property of the data-plus-objective, not a typo you can find and delete.

It helps to separate two senses of the word "bias" that sound alike but mean opposite things. In the bias-variance sense from your statistics rungs, bias is a model being too simple to capture reality — a purely technical quantity. In the fairness sense, bias means the model is *too accurate* at capturing a reality that is itself unjust. A model can have low statistical bias and high social bias at the same time. Keeping these apart prevents a lot of confused arguments.

Where the bias actually enters

The first source is the data itself. [[dataset-bias|Dataset bias]] means your training set is not a fair sample of the world the model will face. Maybe a medical dataset is 90% one ancestry, so the model learns skin lesions on pale skin and stumbles on dark skin. Maybe historical loan records reflect decades of redlining, so "who repaid" is entangled with "who was ever offered a loan." The model has no way to know the sample is skewed — to it, the data *is* the world. This connects directly to class imbalance and sampling problems you met earlier, but now the stakes are people.

The second source is the labels. Most supervised training depends on human annotation, and humans carry their context with them. Annotators rating "toxicity" may flag African American English dialect as rude more often; raters judging "professional appearance" import cultural norms. The label is supposed to be ground truth, but it is really *someone's judgment frozen into a number*. If that judgment is uneven across groups, the model inherits the unevenness and, worse, sands off the disagreement into a single confident answer.

The third source is the most slippery: the model latches onto a spurious correlation — a pattern that holds in the training data but is not the thing you actually care about. A skin-cancer detector learns that surgical rulers (placed beside real tumors by doctors) predict malignancy; a pneumonia model reads the hospital's portable-scanner watermark instead of the lungs. This shortcut learning is efficient and often invisible on a test set drawn from the same skewed source — which is exactly why bias survives a good-looking accuracy number.

Defining "fair" — and discovering it's plural

To measure fairness you have to define it, and the moment you try, it splinters. The simplest group definition is [[demographic-parity|demographic parity]]: the model should accept the same *fraction* of each group. If 20% of men get a loan, 20% of women should too. It is intuitive and easy to audit. But it ignores whether the people in each group are actually similar on the thing you're predicting — it can force you to approve unqualified applicants in one group or reject qualified ones in another just to balance the rates.

A more refined notion is [[equalized-odds|equalized odds]]: among people who *truly* qualify, the approval rate should match across groups (equal true-positive rates), and among those who truly don't, the error rate should match too (equal false-positive rates). This says "make the same kinds of mistakes at the same rate for everyone," which feels closer to justice. Recall your evaluation rung — these are just group-conditioned slices of the confusion matrix, the true/false-positive rates you already know, computed separately per group.

Here is the result that ends the dream of one perfect metric. Unless the groups have identical underlying base rates — or your model is a perfect oracle — you *cannot* satisfy demographic parity, equalized odds, and equal predictive value all at once. This is a proven impossibility, not an engineering shortcoming. Choosing a fairness definition is therefore choosing whose errors matter more; it is a value judgment dressed as a formula, and no amount of cleverness dissolves the trade-off.

Group A: base rate 30% qualify   Group B: base rate 10% qualify

Force equal acceptance (parity)  ->  must over-accept B or reject qualified A
Force equal error rates (odds)   ->  acceptance fractions then differ

  parity  AND  equalized-odds  AND  equal predictive value
  ---- impossible together unless base rates match ----

Different base rates make the three fairness criteria mutually exclusive — a sketch of the impossibility result.

Group fairness vs. the individual

Every metric above is a *group* statistic — it compares averages across categories. But there is a competing intuition, individual fairness: similar people should be treated similarly, regardless of group. These two ideals can collide head-on. Adjusting a threshold per group to equalize outcomes (group fairness) means two applicants with identical files get different decisions because of their group (individual unfairness). There is no neutral ground; even "treat everyone by the same rule" is itself a choice with winners and losers.

There is also a kind of harm that no allocation metric captures at all: [[representational-harm|representational harm]]. This is when a system demeans or erases a group rather than mis-allocating a resource. An image generator that renders "CEO" as men and "nurse" as women, an autocomplete that finishes "Muslims are" with violence, a translation tool that defaults a doctor to "he" — none of these denied anyone a loan, yet each reinforces a stereotype at scale. Because the harm is about meaning and dignity, not a yes/no decision, you cannot audit it with a confusion matrix; you have to look at what the model *says* and *depicts*.

What you can actually do

None of this is a counsel of despair. Bias is not unfixable; it is unfixable *automatically and once-and-for-all*. The honest practice is to make bias visible and make the trade-offs explicit, then keep watching after deployment. Treat fairness as an ongoing audit, not a checkbox you tick before launch.

Disaggregate your evaluation: never trust one overall accuracy number. Slice every metric by group and look at the worst-off slice, not the average.
Interrogate the data and labels: who is missing, who labeled it, and what could be a proxy or shortcut for a sensitive trait?
Pick a fairness definition deliberately, write down why, and acknowledge what that choice trades away — because you cannot have all of them.
Keep a human in the loop for high-stakes decisions, and keep monitoring after launch, since the world drifts away from your training data over time.