When a Language Model Learns to Derive Gravity

When a Language Model Learns to Derive Gravity

19 May 2026, Yanjiang

heading

An AI apprentice carves cosmic perturbation equations into a stone of spacetime, learning gravity through worked examples and algebra.

Imagine a master stonemason teaching an apprentice to carve a complex rosette. The master demonstrates: here is the chisel, the angle, the rhythm. The apprentice watches, mimics, and soon produces a passable copy. But the apprentice does not understand why the stone breaks along those lines, nor what forces dictate the grain. If asked to carve a different pattern—one the master hasn’t shown—the apprentice might falter, or worse, produce something that looks right but cracks under stress.

Something like this apprenticeship is now unfolding in theoretical physics, but the apprentice is not a person. It is a large language model. In a preprint (arXiv:2605.08212), Anamaria Hell and Leander Thiele at the Kavli IPMU in Tokyo ask whether a frontier LLM, supplied with worked examples and a computer algebra system, can reliably perform the algorithmic tasks that make up much of modern cosmology: calculating perturbations in modified theories of gravity. Their answer is a qualified yes—a finding that, depending on whom you ask, is either a glimpse of the future or a cautionary tale about mistaking mimicry for understanding.

What is the hardest part of being a theoretical cosmologist, anyway? The popular image is of a lone genius staring at a blackboard dense with Greek symbols. But ask any working physicist, and they will tell you: the hardest part is not the grand conceptual leaps. It is the algebra. To connect a modified gravity theory to the data—the cosmic microwave background’s faint glow, the clustering of galaxies—one must perform a sequence of computationally heavy yet conceptually straightforward operations. Write down the action, perturb it to second order, decompose the tensor modes into scalar, vector, and tensor pieces—like breaking a figure skater’s motion into straight-line speed, a spin, and a wobble—then simplify using symmetry identities and extract the equations of motion. These steps are not intellectually deep. They are, however, brutally error-prone. A factor of two misplaced in the middle of a ten-page derivation can propagate into a wrong prediction, wasting months of follow-up work. It was inevitable that someone would ask: could a machine do this instead?

The Apprentice Learns by Example

The answer begins with a setup that feels almost pedagogical. Hell and Thiele took Claude, a state-of-the-art large language model, and linked it to Maple, a computer algebra system. Then they gave it a “context”—a description of the gravitational theory, the metric ansatz, and, crucially, a fully worked example for an already understood model. Finally, they asked it to repeat the derivation for a slightly different theory. It is the classic teaching move of “now you try one”—only the student is silicon.

“A frontier LLM supplied with worked examples,” they write, “is able to solve most test problems.” On a suite of modified gravity scenarios—including R² inflation, constrained-scalar f(R) theories, and a model with a non-minimal R_munuR^munu coupling—the model, given a sufficiently rich example, correctly computed the full set of perturbation equations, sometimes in minutes. For a cosmologist accustomed to spending days verifying index contractions, this is tantalizing: the drudgery of derivation could, in principle, be offloaded to a tireless, endlessly patient assistant.

But here the ripples in the analogy’s surface begin to spread. The evaluation, by the authors’ own admission, was sometimes lenient. In one case, the LLM omitted the background equations of motion entirely, yet the final answer was judged a pass because the missing terms did not affect the result for that particular problem—a choice that could easily have led to wrong predictions in more complicated scenarios. The answer was technically correct, but the reasoning was unsound. A stonemason who carves a rosette that holds together only because of a hidden fault in the stone is not one you want working on your cathedral.

The Verification Gap

This is where earlier work sharpens the critical edge. Lu and colleagues, in a study asking whether language agents can replace human researchers, argued that current LLMs lack the systematic verification loops that scientists rely on instinctively. A neuro-symbolic architecture, like the one demonstrated by Sultan et al. for geometric proof generation, weaves a symbolic solver into the LLM’s reasoning chain and checks every step for logical consistency. But the Hell–Thiele pipeline does not yet include such a feedback mechanism. It cannot re-examine a derivation’s output against deep physical consistency conditions—counting degrees of freedom, for example, or verifying that a constrained scalar field has been perturbed alongside the metric. As a result, the model can “pass” a test while silently violating the principles that make a field theory physically meaningful.

That this happens at all raises an alarm that even larger models may not silence. An important question, sharpened by the test‑time scaling comparisons of Gao et al., is whether simply providing more examples, or more carefully prompting, can overcome these failures—or whether a fundamentally different verification architecture is required. Hell and Thiele’s own documentation leans toward the latter interpretation. When the context was reduced to a bare written description of the method—no worked example—the LLM floundered, succeeding in only three of nine problems and averaging nine session restarts per program. It could not reconstruct the rules of the game from first principles. This gap between retrieving information and executing a procedure reveals something deep: reciting the recipe for a rosette is not the same as holding the chisel and feeling the grain of the stone.

What Does It Mean to “Do” Physics?

The tension at the heart of this preprint is not whether AI will eventually participate in physics—it already does, in constrained settings—but what we mean when we say it “does” physics. If the machine’s output matches a human-crafted result, does the epistemic provenance matter? The dialectical tradition forces a sharper formulation of the question: is a physical theory the set of equations, or the entire chain of reasoning that guarantees their correctness? One can imagine a future in which an LLM, working in concert with a verification engine like that advocated by Sultan et al., becomes not a replacement for the physicist but an extension of the physicist’s own capacity for rigour—a second self that checks and re-checks every algebraic twist. But that future is not yet here.

The paper’s virtue—and it is a considerable one—is that it refuses to pretend otherwise. It documents the failures as carefully as the successes, quantifies where the model stumbles, and then invites the community to build something better. “This work,” the authors conclude, “is therefore also a call to develop more robust and automatic evaluation procedures for AI-assisted theoretical physics.” That invitation is the most significant thing about the study. It signals a shift from asking whether machines can replace us to asking how we must transform our own practices—our habits of verification, our standards of evidence—in order to collaborate with them safely.

Perhaps one day, a cosmologist who wants to test a new idea about dark energy will turn not to a blank page but to an AI that can chisel out the necessary equations while a symbolic solver watches over its shoulder. But before that day comes, we must teach the apprentice not just to mimic, but to know when its own carving is unsound and to ask for a second pair of eyes. The verification loop must close. Until it does, we should marvel at the apprentice’s dexterity while keeping a firm hand on the chisel—and remember that the deepest skill in any craft is not producing the artefact, but knowing what a good one looks like.

— Yanjiang

Yanjiang is an online editor of LoomSci.com.

References

  • Hell and Thiele, LLMs with in-context learning for Algorithmic Theoretical Physics, arXiv:2605.08212
  • Sultan et al., A Neuro-Symbolic Approach for Reliable Proof Generation with LLMs: A Case Study in Euclidean Geometry, arXiv:2505.14479
  • Lu et al., Can Theoretical Physics Research Benefit from Language Agents?, arXiv:2506.06214
  • Gao et al., Test-time Scaling Techniques in Theoretical Physics – A Comparison of Methods on the TPBench Dataset, arXiv:2506.20729