Learning to Trust a Crowd of Almost-Right Models

Lynn · June 8, 2026, 7:12am

Learning to Trust a Crowd of Almost-Right Models

08 Jun 2026, Yanjiang

heading

In Rashomon Partition Sets, the data support a crowd of almost-right models, forcing researchers to confront what is robust and what is fragile.

In Akira Kurosawa’s film Rashomon, four witnesses recount the same crime, each telling a story that is coherent, plausible, and yet irreconcilable with the others. The film refuses to hand us a single correct account; it insists we sit with the discomfort of multiple truths. Statisticians building models of the world face a surprisingly similar dilemma. When several possible models all fit the data nearly as well, how do we decide which one to trust—and what do we lose if we pretend only one of them is real?

A team led by Tyler H. McCormick at the University of Washington, working with collaborators at Stanford University, has proposed a systematic answer. In a preprint (arXiv:2404.02141), they introduce a framework called Rashomon Partition Sets (RPSs), designed to map out the whole crowd of nearly‑equivalent models for a common class of problems: factorial data. The idea is not to pick the single best model and pretend the others never existed, but to enumerate every model that comes close—to give researchers a clear view of what is certain, what is fragile, and what remains genuinely unknowable from the data at hand.

A forest of possible partitions

Factorial studies are everywhere: a medical researcher measures an outcome—say, average telomere length—across combinations of factors like age, gender, education, and race. The natural question is which subgroups really differ from one another. In principle, the data can be carved up into an immense number of possible groupings, or partitions. Choosing a partition that is too coarse risks smoothing away real heterogeneity; choosing one that is too fine risks overfitting noise. Model uncertainty here is not a technical footnote—it is the central scientific issue.

McCormick and his colleagues tackle the problem within a Bayesian framework. They ask: which partitions of the factorial cells have high evidence, and how much do they differ in the stories they tell? Their answer is the Rashomon Partition Set—the collection of all partitions whose posterior probability comes within a hair’s breadth of the maximum a posteriori (MAP) model. Think of the MAP as the leading eyewitness; the Rashomon set includes every other witness whose testimony is almost as credible. The team works with an ℓ₀ prior, which encourages sparsity without imposing strong assumptions about which subgroups should be linked. Under this prior, models that avoid unnecessary complexity are naturally favoured, yet the data remain free to reveal intricate patterns of heterogeneity when they exist.

Critically, the team does not rely on random sampling to explore the space of partitions—an approach that might overlook models, especially those that are far from the MAP in structure but nearly as good in posterior probability. Instead, they construct a recursive algorithm that enumerates all partitions that respect a partial order among the factorial levels. When a partial order exists—for example, when higher education levels are expected to be associated with longer telomere length—the algorithm can prune the impossible and methodically list every viable partition. The upshot is a guarantee: within the constraints of the design, no plausible model is silently ignored.

Reading the chorus

What does a Rashomon set look like in practice? The team re‑analysed a well‑known dataset on charitable giving first studied by Dean Karlan. The experiment varied whether a donation was matched, by what ratio, and the giver’s political leaning. Within the Rashomon set, some models showed a strong positive effect of matching among Democrats when certain suggested donation amounts were used, while other models—just as well supported by the data—indicated a substantial negative effect. The Rashomon set does not resolve this disagreement; it displays it openly. A researcher who examined only the MAP model would walk away with firm conclusions that, it turns out, the data themselves do not fully justify.

fig22

Dense blue bands highlight the most frequently estimated patterns, while the black line marks the true relationship. This reveals which findings are robust across many plausible models, narrowing uncertainty in complex data. (Source: arXiv:2404.02141)

The framework finds even richer application in public health. Using data from the National Health and Nutrition Examination Survey (NHANES), the team explored how telomere length—a biomarker linked to ageing—varies with hours worked, gender, age, and education, all further stratified by race. The Rashomon set reveals that certain patterns are robust: for instance, the direction of the age effect on telomere length appears consistent across nearly all high‑evidence partitions. But for other factor combinations, the sign and magnitude of the effect depend sensitively on which model one chooses. This is not a flaw in the method; it is a faithful map of what the data can and cannot resolve.

fig26

Telomere length differs by hours worked, gender, age, and education, and these differences vary across racial groups. This reveals true heterogeneity in health data, helping avoid false discoveries from overfitting. (Source: arXiv:2404.02141)

The approach builds on earlier work by Kobylińska and colleagues, who showed that examining the Rashomon set can make medical predictions more trustworthy by revealing how much explanatory latitude the data genuinely allow. However, that earlier exploration used stochastic search, which can be haunted by the worry that some important model might have been missed. The University of Washington team’s exhaustive enumeration, although currently limited to designs with up to about eight factorial cells, eliminates that uncertainty entirely for small‑design problems.

Fairness and the limits of near‑equal models

There is, however, a subtler limitation. By definition, the Rashomon set is anchored to the MAP model; it collects all partitions whose posterior probability is within a chosen epsilon of that one brightest point. This means partitions that are far from the MAP in probability are excluded, even if they might be of special interest for fairness or equity. An important insight from earlier work by Parikh and colleagues highlights a tension here: a model that fits the data best for the majority can systematically overlook certain underrepresented groups. If the data offer only weak evidence about a disadvantaged subpopulation, then partitions that adequately describe that group may fall outside the Rashomon neighbourhood, simply because they are not as probable. The RPS, for all its transparency, inherits this focal bias. It is an outstanding question whether one could extend the framework to encompass partitions that are less probable but still societally salient—a kind of Rashomon set that listens, not only to the loudest voices, but also to the quiet ones.

Still, what the team has achieved is a genuine advance in the craft of statistical explanation. For decades, scientific practice has largely privileged a single “best” model as if it were the truth. The Rashomon Partition Set replaces that comforting fiction with a map of plausible alternatives. It forces researchers to ask, not “What does the model say?” but “What range of conclusions do the data allow?” In an era when machine‑learning models are increasingly deployed in medicine, policy, and law, building this kind of epistemic humility into our tools is not a luxury; it is the foundation of trust.

The road from eight cells to wider factorial designs is far from straightforward, and the fairness questions remain open. But the direction is clear. A good model does not pretend to certainty where there is none. It shows us all the stories the data could be telling, and leaves us—wiser, humbler, more cautious—to decide what to do next.

— Yanjiang

Yanjiang is an online editor of LoomSci.com.

References

Aparajithan Venkateswaran et al., Robustly estimating heterogeneity in factorial data using Rashomon Partitions, arXiv:2404.02141
Kobylińska et al., Exploration of the Rashomon Set Assists Trustworthy Explanations for Medical Data, arXiv:2308.11446
Parikh et al., Who Are We Missing? A Principled Approach to Characterizing the Underrepresented Population, arXiv:2401.14512

Topic	Replies	Views
What Happens When AI Maps the Cracks in Scientific Consensus? Science 365	0	May 16, 2026
When Bias Tests Need a Baseline to Find Themselves Science 365	1	May 6, 2026
When Data Learns to Answer ‘What If?’ Science 365	0	June 13, 2026
When a Mathematical Framework Promises Unity, But Leaves Proofs Behind Science 365	0	May 17, 2026
A Million Voices, One Culprit: The Invisible Failure of Small-Scale Explanations Science 365	0	May 16, 2026

Learning to Trust a Crowd of Almost-Right Models

Learning to Trust a Crowd of Almost-Right Models

A forest of possible partitions

Reading the chorus

Fairness and the limits of near‑equal models

Related topics