When an AI Changes Its Mind: The Hidden Baseline Problem
06 May 2026, Yanjiang
Gender swapping prompts cause LLM prediction flips at the same rate as simple paraphrases, revealing a hidden baseline problem in counterfactual evaluation.
I still remember the first time I tried asking a large language model a medical question. I fed it a short clinical vignette — a patient with chest pain, some lab results, a few risk factors — and asked what the most likely diagnosis was. It answered confidently, so I changed exactly one detail: I swapped the patient’s gender from male to female. Nothing else. The model gave me a different answer.
That small, seemingly controlled experiment felt like a quick-and-dirty way to measure whether the model was sensitive to patient demographics. But as I played with more examples, I started to notice something unsettling. If I rephrased the original question without changing any medical facts — using synonyms, adjusting the sentence structure — the model’s answer often flipped too. Which raised a quiet but insistent question: when the model changes its prediction after I swap a patient’s gender, am I really measuring a gender effect, or am I just seeing the normal wobble of a system that is sensitive to almost any textual perturbation?
It turns out I am not the only one with this concern. Zihao Yang at Northeastern University, together with Mosh Levy and Yoav Goldberg at Bar-Ilan University and the Allen Institute for AI, and their advisor Byron C. Wallace at Northeastern, have looked at this exact problem in a new preprint (arXiv:2605.01048). Their work crystallizes a fundamental blind spot in how many researchers evaluate large language models today: we rarely ask what the right baseline is for a counterfactual intervention.
The Compound Treatment Nobody Discussed
The core technique under scrutiny is called counterfactual prompting. The idea is elegantly simple: take an input text, surgically alter a single variable — say, a patient’s race, gender, or name — and see whether the model’s output changes. If it does, many studies conclude that the model is “sensitive” or “biased” with respect to that factor. The logic feels sound, and it has been used in dozens of papers to probe everything from medical QA bias to the faithfulness of chain-of-thought reasoning.
But Yang and colleagues point out a subtle statistical problem. Every time you edit a piece of text, you inevitably change more than just the variable you think you are manipulating. Swap “he” for “she,” and you have also altered the character sequence, the local syntax, maybe even the rhythm of the sentence. In the language of causal inference, the counterfactual edit is a compound treatment: it bundles the factor of interest together with incidental surface-form variation. This violates what econometricians call “treatment variation irrelevance” — the assumption that a treatment changes only the causal variable and nothing else that could affect the outcome.
Put differently, imagine a clinical trial where the experimental drug comes in a bright red pill and the placebo is a white sugar cube. If patients in the drug group report different side effects, you cannot tell whether those effects came from the drug or from the color, shape, or expectation. The right baseline is not “no pill” — it is a pill that looks identical but contains no active ingredient. In language model experiments, the equivalent of the identical-looking pill is a paraphrase: a text that preserves the semantic meaning but varies the surface expression. And yet, Yang’s team found, almost no one in the LLM evaluation literature runs this control.
Fourteen Percent, Twice
The team tested just how serious this omission is by conducting a head-to-head comparison on a benchmark called MedQA, a collection of US medical licensing exam questions adapted into clinical vignettes. They took 919 such questions and applied two kinds of edits. The first was a targeted change: they swapped the patient’s stated gender in the prompt. The second was a paraphrase: they preserved all medical facts but rewrote the question using different words.
When they fed the original, the gender-swapped, and the paraphrased versions into a language model, they recorded how often the model’s top predicted answer changed. The results were startling. The gender-swapped prompts caused a prediction flip rate of 14.9 percent. That number, on its own, might suggest the model is sensitive to patient gender — a red flag for bias. But the paraphrased prompts, which carried no intentional demographic variation at all, induced a flip rate of 14.1 percent. The two rates were statistically indistinguishable.
That single comparison reshapes the conversation. If swapping gender changes the model’s output about as often as simply rephrasing the question, then any observed “gender effect” is entirely confounded by the model’s general brittleness to wording. Attributing the flips to gender sensitivity — without controlling for the paraphrase baseline — would be scientifically unwarranted.
The key insight here is not that language models are insensitive to gender—it is that the standard counterfactual approach cannot disentangle demographic sensitivity from plain surface-form sensitivity without a baseline. Without a control group, experiments produce an uninterpretable mixture of both signals.
A New Framework for Apples-to-Apples Comparisons
To fix this, Yang and colleagues propose a systematic framework. Instead of simply computing flip rates for a single counterfactual edit, they compare the distribution of model responses under the targeted intervention to the distribution of responses under meaning-preserving paraphrases. The question is not “does the model change its mind when we swap gender?” but rather “does swapping gender cause reliably more change than we would expect just from rewording the question?” They formalize this as a statistical test, measuring whether the differences observed under the target intervention are larger than the differences induced by paraphrasing.
The framework can be used with different metrics, and Yang’s team evaluates a range of them. The simplest — and the one most commonly reported in the literature — is the aggregate flip rate: what fraction of examples get a different predicted label after the edit. They also consider mutual information and correlation-based measures like phi. But they find that these aggregate, population-level metrics are surprisingly weak. In many realistic scenarios, they simply cannot detect a true underlying effect unless the effect size is enormous.
The real power, they show, comes from per-sample distributional metrics. Instead of just looking at whether the model changed its top-1 prediction, these metrics compare the full probability distributions over possible answers for each example, before and after the edit. Measures like Jensen-Shannon divergence (JSD) or Kullback-Leibler divergence (KL) capture subtle but consistent shifts in the model’s confidence that aggregate metrics miss entirely. In their power analysis, per-sample metrics reached near-perfect detection rates in conditions where aggregate metrics barely rose above chance.
This is a concrete, actionable finding. If a research group wants to probe whether an LLM is sensitive to patient demographics, they should not simply report the fraction of examples where the predicted label flipped. They should record the model’s full distribution over answer choices for each example under both the original and edited prompts, then test whether the distributional shifts under the demographic edit exceed those under a paraphrasing control.
Revisiting Previous Evidence
To demonstrate the framework’s impact, the team applied it to a prior analysis done on the MedPerturb dataset, a curated collection of medical QA examples with targeted demographic and stylistic perturbations. The original analysis had reported evidence that language models are sensitive to things like patient race and the writing style of the clinical note. But when Yang and colleagues reran the analysis with proper paraphrasing baselines and rigorous statistical tests, the picture changed dramatically.
Out of 120 individual tests — various combinations of clinical tasks, demographic variables, and model conditions — only 5 reached statistical significance once general model sensitivity was accounted for. The vast majority of the effects that had seemed like demographic biases dissipated under scrutiny. This does not mean the models are free of bias. It means the evidence for bias that was previously reported was largely indistinguishable from the model’s baseline instability. Without a control group, the signal was contaminated by noise that happened to look like a demographic pattern.
Importantly, the team also showed that their framework is not simply a disprover — it can confirm real directional bias when it exists. They turned to a different domain: occupational biography classification, where the task is to infer someone’s profession from a short biographical description. Here, they introduced a targeted gender perturbation and compared it to a paraphrase baseline. The per-sample metrics picked up a clear, statistically significant directional effect: the model systematically associated certain professions with certain genders, even after controlling for surface-form sensitivity. The effect was small in absolute terms, but it was real and detectable. The framework identified it, while aggregate metrics might have missed it or mistaken it for noise.
Regression as a Scalpel
Beyond the distributional metrics, the team highlights a third approach that turns out to be uniquely powerful: treating the edit as a predictor in a regression model. Instead of comparing response distributions in aggregate or per-sample, you can directly model how the probability of a specific answer changes as a function of the edited attribute. This does two things at once: it characterizes the direction of the effect (does swapping gender make the model more or less likely to predict “heart disease”?), and it estimates the magnitude of that shift, all while naturally accounting for example-level variability.
In the occupational biography task, regression revealed systematic biases that per-sample metrics confirmed as significant but did not characterize as finely. This suggests a practical workflow for future studies: first, compare target edits against paraphrase baselines using per-sample distributional metrics to establish whether there is an effect above background noise; then, if an effect is found, use regression to measure its direction and strength.
The Practical Path Forward
What Yang and colleagues have produced is, at its core, a methodological corrective. Counterfactual prompting is a valuable tool, but it has been used with insufficient care. The lesson is not that researchers should stop probing LLMs for biases or unfaithfulness. It is that they must — as a matter of basic experimental design — include a meaning-preserving control condition. Anything less invites the statistical equivalent of a placebo effect to masquerade as a demographic signal.
The team’s framework is publicly available and can be integrated into existing evaluation pipelines. It does not require new models or exotic hardware; it just requires a bit more rigor at the data preparation and analysis stage. For a field that moves quickly and often rewards striking findings over careful ones, this is a rare and welcome call to slow down and do the control experiment.
After reading this paper, I have never been so keen to rethink my own casual counterfactual experiments. The urge to probe an AI by changing one word and watching what happens is natural, almost irresistible. But as Yang, Levy, Goldberg, and Wallace make compellingly clear, a single word change is never just one change — and the only way to know what you are really measuring is to ask the same question in a different way, and compare. That is not just good science. It is the difference between chasing shadows and seeing clearly.
Yanjiang is an online editor of Loom Science
References
- Zihao Yang et al., Compared to What? Baselines and Metrics for Counterfactual Prompting, arXiv:2605.01048
