When AI Whispered to Thousands of Papers
26 May 2026, Yanjiang
A randomized experiment with over 31,000 preprints found that AI feedback increased revision rates by 12.55%, especially benefiting researchers from underrepresented regions and early-career scientists.
Science runs on conversation — the murmured suggestion after a seminar, the blunt critique scrawled in a referee report, the late-night email from a collaborator who has spotted a hole in your reasoning. This informal, invisible economy of feedback is what turns a rough idea into a robust results. But it is distributed as unevenly as any other precious resource. A well-known professor at a top department can count on dozens of sharp eyes; a postdoc at a small institution in a non-English-dominant country might send a manuscript into a void, hearing only the echo of their own uncertainties.
What if the silence could be broken not by a human gatekeeper, but by a machine? That is the audacious question behind a global experiment whose results now appear in a preprint (arXiv:2605.24180). A team led by Binglu Wang and Yian Yin delivered customised, AI-generated feedback to the authors of more than 31 000 arXiv preprints across many fields and geographic regions — a randomised field experiment that reached over 45 000 researchers. The findings offer a first causal glimpse of what happens when the feedback loop of science, long thought of as a scarce and private gift, is transformed into something that can be offered at scale.
The design was audaciously simple. In the first half of 2024, the team collected every recently posted arXiv preprint that had not yet been revised. For each, they used a large language model to generate a structured critique — noting strengths, flagging gaps, offering suggestions. Then they flipped a coin. Some authors received the AI feedback in their inbox; others heard nothing. A month later, the researchers checked whether the authors had uploaded a revised version of their manuscript.
The headline number was striking: authors who received AI feedback were 12.55% more likely to revise their papers than those in the control group. To a physicist, a 12.55% relative increase sounds modest but real — roughly the experimental signal you might celebrate if you had built a delicate interferometer and found a faint fringe shift that wasn’t just noise. And the effect was not evenly distributed. It was strongest among authors from non-English-dominant research regions, teams with lower h-indexes, and early-career scientists — exactly the groups that the traditional feedback economy tends to leave behind. The whisper reached deepest where the silence had been loudest.
One might object that 12.55% is only a relative gain on a low baseline — that many papers go unreviewed not for lack of feedback, but because their authors judge them complete, or because the academic incentives push them toward new projects rather than polishing old ones. This is a legitimate caution, but it misses the deeper pattern. The team found that exposure to AI feedback also increased the likelihood that authors would use large language models in their subsequent preprints. Something beyond a one-off revision was happening: a subtle but measurable shift in scientific practice, a kind of habit formation in the presence of a new tool.
Here the study’s secondary outcome becomes tantalising. The feedback was not designed to be transformative in the way a visionary human reviewer might be — it was courteous, constructive, and precise, but it rarely delivered the kind of destabilising intellectual provocation that overturns a paradigm. An important question raised by earlier work on AI-assisted review, notably a study by researchers exploring peer-review dynamics with LLM agents, is whether machine-generated suggestions, while prompt and polished, can achieve the conceptual friction that often pushes research forward. Polite feedback, runs the worry, might smooth the surface of a manuscript without troubling its deeper assumptions.
This result also sits in an interesting tension with a large-scale human study that asked LLMs to generate novel research ideas. That work found that while the AI’s ideas were judged to be novel, they were often less likely to be recognised as such by human reviewers, hinting that the kind of feedback that sharpens an existing paper may be qualitatively different from the kind that catalyses genuinely new directions. The global experiment by Wang and colleagues thus sharpens a question that goes well beyond metrics of revision: if AI feedback becomes ubiquitous, will it nudge science toward incremental polish at the expense of the messy, risky, sometimes brilliant provocations that come from human argument?
Yet precisely the scale of the experiment permits a counter‑reflection that softens this concern. The same type of feedback that might feel too bland for a star researcher in a dense network could be a lifeline for someone who has never received a detailed critique at all. The equity dimension is not a secondary garnish; it is the main dish. When a mathematician in a small university in a country where English is not the dominant language of research receives a structured, thoughtful reading of their preprint — perhaps for the first time ever — the mere fact of being seen and responded to may be more consequential than the specific content of the response. The feedback, in this light, acts less like an intellectual sparring partner and more like a door that opens onto the hallway of global conversation.
What, then, does this experiment challenge? It challenges the assumption that high-quality scientific feedback is inherently a zero-sum resource that must be rationed by the academic elite. It challenges the implicit belief that the current distribution of critical attention is meritocratic rather than shaped by geography, language, and institutional prestige. And it challenges a deeper, almost unconscious picture of scientific discourse: the idea that a meaningful intellectual exchange requires two humans sharing the same cognitive and cultural space. The algorithm did not understand the physics, the mathematics, or the poetry of a research paper. But it could stage a credible enough imitation of understanding to provoke a human response — and sometimes, in science as in so many fields, a plausible gesture is enough to alter a trajectory.
The sleeper finding of the preprint may be the one about habit. The fact that authors who had been exposed to AI feedback later began using LLM tools in their subsequent work suggests that what the experiment really tested was not a single intervention but an introduction. It was an apprenticeship in a new kind of scientific literacy — one in which the researcher learns to treat the AI not as an oracle but as a useful, tireless, and occasionally wrong colleague. The 12.55% is not the endpoint; it is the first visible ripple from a stone that has been dropped into a global pond.
All of this must be held against the study’s limitations, which the authors acknowledge with a frankness themselves unusual. The observation window was a single month; we do not know whether the feedback led to better papers, only that it led to more revisions. There was no placebo — no control group that received a human-written, identically structured message to test whether the effect was due to the mere act of being contacted. And the feedback itself, while customised, was generated by models that are known to hallucinate and to produce prose that sounds authoritative even when it is nonsense. The experiment thus stands as a landmark proof of concept, not as a finished product ready for policy.
The question this experiment leaves us with is not whether machines can serve as useful critics — the data suggest they can, within carefully drawn boundaries — but whether, as we scale this capability, we will teach researchers to write better science or simply to write what the machines like to read. The answer may depend on the invisible architecture of the feedback itself: whether it is designed to invite doubt, to challenge assumptions, to push authors toward the dangerous edge of their own ideas. A whisper, after all, can either soothe or alarm. Which one we amplify is a choice that remains in human hands.
— Yanjiang
Yanjiang is the founding editor of LoomSci.com, specializing in physics and science communication.
References
- Binglu Wang et al., Human-AI Collaboration in Science at Scale: A Global Large-scale Randomized Field Experiment, arXiv:2605.24180
- Wang et al., AgentReview: Exploring Peer Review Dynamics with LLM Agents, arXiv:2406.12708
- Si et al., Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers, arXiv:2409.04109