When Bias Tests Need a Baseline to Find Themselves
06 May 2026, Yanjiang
A gender swap and a harmless rephrase cause nearly identical answer flips in an AI diagnostic model, revealing that apparent bias may be universal jitteriness.
A few weeks ago, I was experimenting with a large language model as a diagnostic assistant. I typed in a short clinical vignette — a patient with chest pain, shortness of breath, sweating — and asked for the most likely diagnosis. The model responded promptly: “myocardial infarction.” Then, on a whim, I changed a single word. I replaced “he” with “she,” leaving everything else identical. The model hesitated, then offered “anxiety attack.”
I stared at the screen, startled. Was this a smoking gun for gender bias? I might have written it up as a case study, a cautionary tale about AI in medicine. But then I tried something else: I kept the gender the same and simply reworded the sentence, still describing exactly the same symptoms and the same patient. The answer flipped again, between “pulmonary embolism” and “costochondritis.”
That stopped me cold. Perhaps the model wasn’t reacting to gender at all. Perhaps it was reacting to the fact that I had changed something, and the change nudged it into a different corner of its probabilistic landscape. That hunch — that what looks like a specific sensitivity might actually be a universal jitteriness — is the subject of a sharp, methodical preprint (arXiv:2605.01048) from a team led by Zihao Yang at Northeastern University, working with colleagues at Bar-Ilan University and the Allen Institute for AI. Their message is simple and unsettling: most studies that claim to detect a model’s sensitivity to a particular factor — a patient’s race, a candidate’s gender, a stylistic tic — are missing the most obvious control group: the model’s sensitivity to any change at all.
This is the problem of counterfactual prompting, a technique that has become the go‑to tool for auditing large language models. To test whether a model treats male and female patients differently, you edit the prompt so that the gender swaps while everything else stays the same, and you watch whether the output changes. The idea is elegantly clean: you hold all other factors constant, vary only the factor of interest, and attribute any difference to that factor. But as Yang and collaborators argue, reality refuses this cleanliness. Every counterfactual edit is a compound treatment, a bundle that delivers the variable you care about along with incidental surface‑form variation — a different word order, a slightly different phrasing rhythm, a different distance between tokens in the model’s internal representation. You cannot isolate one ingredient from that bundle.
The technical name for this violation is treatment variation irrelevance. In causal inference, it says: if you want to claim that a treatment causes an effect, the treatment must be delivered in a way that is unrelated to the outcome except through the variable you are testing. But language doesn’t work that way. Changing a patient’s gender from male to female doesn’t just flip a Boolean flag; it reshapes the entire sentence in subtle, unavoidable ways. The model might react to the fact that the word “he” has three characters and “she” has four, that the surrounding words shift their attention patterns, or that the model has seen certain syntactic templates more often with one gender pronoun than the other. In that tangle, attributing a flipped answer to gender alone is an act of faith, not evidence.
To make this concrete, the team turned to MedQA, a benchmark of clinical multiple‑choice questions used to evaluate medical reasoning. They surgically changed the patient’s gender in 919 vignettes, carefully editing only the pronouns and gendered terms, and measured how often the model flipped to a different answer. The flip rate was 14.9%. That number might seem alarming — nearly one in seven answers changed — and a naive interpretation would blame gender bias. But the team then applied a second, innocuous treatment: they paraphrased the same questions, rewording them while preserving all medical content, including the original gender. The flip rate was 14.1%. The two numbers were statistically indistinguishable. A model that does not care about gender at all but simply reacts to having its input jostled would produce exactly this pattern.
If you had run only the gender‑swap experiment and reported that flip rate as evidence of gender sensitivity, you would be mistaking a general phenomenon for a specific one. The model, in this particular benchmark, is about as sensitive to a harmless rephrase as it is to a gender edit. The apparent gender effect vanishes when you compare it to a sensible baseline — a control group that absorbs the model’s inherent brittleness to any textual perturbation.
That baseline, Yang and colleagues insist, must become standard. They propose a framework that does exactly what the MedQA example suggests: for any targeted intervention — a gender change, a race swap, a dialectal shift — you compare the observed effect against the effect you would get by simply paraphrasing the input. You then run a statistical test to ask whether the targeted effect is reliably larger than the paraphrase‑induced effect. Only if the answer is yes can you claim the model is specifically sensitive to the factor you designed.
To show how many published findings this baseline would filter out, the team revisited the MedPerturb dataset, a collection of clinical vignettes that had previously been used to report evidence that models are sensitive to patient demographics and stylistic cues. Out of 120 original statistical tests, only five — 5 out of 120 — survived when the paraphrase control was introduced. The vast majority of the claimed sensitivities evaporated once the baseline was in place. The model wasn’t sensitive to race, or age, or writing style in any special way; it was sensitive to perturbation in general.
But this is not a story about finding nothing. The same framework, applied to a different domain — occupational biography classification — detected a clear, directional gender bias. When the model was asked to classify whether a short biography described a nurse or a CEO, swapping a female name for a male name shifted the predicted likelihoods in a consistent, statistically significant way that far exceeded the noise induced by paraphrasing. So the framework works; it does not erase real effects, it only removes the phantom ones.
That dual outcome — eliminating false alarms while preserving genuine signals — points to the true value of the team’s contribution. The goal is not to declare that large language models are unbiased. It is to ensure that when we say they are biased, we are saying something true and useful.
The paper also provides a practical handbook for future investigators by systematically comparing the metrics used to measure perturbation effects. Aggregate metrics — overall flip rate, mutual information between original and perturbed responses — are tempting because they are simple to compute, but the team shows they are weak. They require large effect sizes before they can reliably detect a difference against the paraphrase baseline. Per‑sample metrics, by contrast, track how much each individual prediction shifts under the perturbation, and they achieve near‑perfect detection even when the effect is modest. Regression analysis, which models the direction and magnitude of the shift, goes further still, characterizing not just that something changed but exactly how and in what direction — whether the model consistently moved in favor of one group over another.
These findings matter well beyond a single paper. The large language model community is awash in bias audits, and the temptation to publish a dramatic finding is enormous. A researcher who swaps one attribute, observes a high flip rate, and reports a bias claim can generate headlines and shape policy. But if the same flip rate appears with a blind shuffle of the text, the original claim is not merely weaker — it is indistinguishable from noise. Yang and colleagues are not asking researchers to stop auditing; they are asking them to audit with a control group, the kind of precaution that would be considered basic scientific practice in any other experimental discipline.
The lesson lands with a quiet sort of force because it is obvious in retrospect. Every scientist learns early that before you celebrate a result, you check the baseline — without a blank, a control, a null comparison, you are looking at a signal that could be anything. In biological experiments, you plate a control well with only the solvent. In clinical trials, you give the placebo. In counterfactual prompting, you paraphrase. The fact that this step was routinely skipped for years is less an indictment of the field than a reminder of how seductive it is to find what you are looking for.
The team — Mosh Levy and Yoav Goldberg from Bar‑Ilan University, with Byron C. Wallace as the corresponding author — refuses to let that subtle seduction stand unchallenged. Their framework does not have a catchy acronym or a flashy demo; it is a piece of methodological infrastructure, the kind of tool that makes every later measurement more honest. The practical takeaway for anyone evaluating a language model is immediate: next time you run a counterfactual experiment, paraphrase your inputs and see whether the effect holds up. If it doesn’t, you haven’t found nothing — you’ve found something important about how your model reacts to change, and the first thing you thought you knew was probably wrong.
This is the point at which I think back to my own little living‑room experiment, the one that started with a chest‑pain vignette and a medicine-flipping model. What I saw wasn’t evidence of gender bias. It was evidence that my test lacked a control. If I had stopped at the first flip, I might have written a blog post that confirmed what I already suspected, and my readers would have nodded along. But I would have been wrong, or at least unjustified, and the model itself would have been no better understood. The team’s preprint (arXiv:2605.01048) doesn’t offer a comforting story, and it doesn’t provide a simple list of biased models to shame or unbiased models to trust. What it offers is clearer and harder to ignore: a method for knowing when you are justified in making any of those claims at all.
Yanjiang is an online editor of Loom Science
References
- Zihao Yang et al., Compared to What? Baselines and Metrics for Counterfactual Prompting, arXiv:2605.01048
