When Transformers Become Partial Differential Equations

Lynn · May 20, 2026, 3:18am

When Transformers Become Partial Differential Equations

20 May 2026, Yanjiang

heading

Transformer training becomes a partial differential equation when token distributions evolve like probability flows under the attention mechanism’s current.

Why does deep learning work? That question has accompanied the ascent of transformer architectures from curiosity to civilisation‑changing force. We build models with hundreds of billions of parameters, train them on datasets so vast they might as well be infinite, and watch them learn to write, translate, and reason — yet the mathematical principles that guarantee this training will succeed, or even converge to something useful, remain an archipelago of partial answers. A preprint (arXiv:2605.17660) from a team led by Gabriel Peyré at CNRS, ENS, and PSL Université, with first author Raphaël Barboni at Bocconi University and collaborators at Rice and RIKEN, now proposes a theory that transforms training into a kind of fluid dynamics. It asks: what if we treat infinitely deep and infinitely wide transformers not as collections of matrices and gradients, but as continuous probability distributions evolving through layers? In that limit, training a transformer becomes a partial differential equation.

The vision is audacious because it requires us to imagine a neural network with no discrete parts at all. When you let the number of attention heads and the number of layers both tend to infinity, what were once individual neurons or token vectors dissolve into a gas of infinitesimal particles — a mean‑field regime familiar from statistical physics. A ResNet, in this picture, becomes a neural ODE: a single token distribution that drifts through layer‑time governed by an ordinary differential equation. But a transformer introduces something qualitatively new. Its attention mechanism couples multiple token distributions at each layer, forcing them to interact like currents in a fluid. What emerges is not an ODE but a PDE — the first rigorous mathematical model of attention that is governed, not by linear algebra, but by the language of transport equations and Wasserstein flows.

Think of it like a river carrying thousands of leaves. Each leaf represents a token vector, and the river’s current is the attention mechanism, coupling the motion of every leaf to every other. As the river flows deeper (more layers), the leaf positions become a probability distribution. Training this system — adjusting the “attention parameters” that govern the current — is an optimisation problem over measures, not over finite vectors. The team’s key insight is that this optimisation can be re‑expressed as a gradient flow in the conditional Wasserstein metric space: the gradient of the training risk, computed via adjoint sensitivity analysis, pushes the distribution of parameters along the most efficient path in the space of probability measures. It is a hike down a loss landscape whose geography is infinitely richer than the Euclidean valleys we normally picture, but whose well‑posedness the authors prove with formidable care.

That is where the theory’s most striking promise lies: if the geometry of that landscape is benevolent, gradient flow will find a global minimum. Benevolence, in this setting, is captured by a condition on the Neural Tangent Kernel — essentially, whether the attention mechanism can distinguish all the right patterns in the data. The team proves that the NTK is injective if and only if certain log‑sum‑exp functions, which encode the transformer’s attention computations, are linearly independent modulo affine functions. This independence holds for discrete distributions, for uniform populations, for Gaussian mixtures — a surprisingly broad spectrum of the token arrangements one actually encounters in practice. Under that condition, and when the training loss is already small, the optimisation landscape contains no spurious local minima. As the authors write, “gradient flow converges to global minima when the initial loss is sufficiently small, eliminating spurious local minima from the optimization landscape.”

But the small‑loss requirement is the catch — a subtle caveat that threads through two decades of mean‑field theory. Earlier work by Stephan Wojtowytsch on two‑layer ReLU networks in the same infinite‑width regime showed that gradient descent finds global minima, provided the network begins near one. The same condition reappears in the global convergence results for multi‑layer ResNets established by Ding and colleagues in 2021: convergence is guaranteed locally, not globally from an arbitrary random start. The transformer theory inherits this limitation. It tells us that, once a transformer is already doing a reasonable job, the mathematical structure is clean enough that gradient flow will finish the work — but it does not yet explain how the network escapes the vast, chaotic slumps of a random initialisation to reach that near‑optimal neighbourhood in the first place. The question, then, is whether the NTK injectivity condition alone is enough to guarantee convergence from any starting point, or whether the “small initial loss” is a necessary price for the proof technique itself.

There is another horizon of subtlety, one that arises from the paper’s deliberate choice to fix the key matrix to the identity. The attention mechanism, in its full glory, multiplies queries, keys, and values in a learned, dynamic dance — but the theory simplifies by holding the key fixed, removing a degree of freedom that complicates the PDE analysis. An important question, sharpened by the recent work of Marion and co‑authors on how deep ResNets implicitly regularise themselves toward neural ODEs, is whether the full, unfixed attention mechanism can be brought into the same mathematical framework. The authors have proven NTK injectivity for a single attention layer; extending the proof to the whole depth of the transformer, with keys and queries coupled, remains open. These are not failures of the theory so much as frontier markers — and the frontier, here, is how far we can push a physical analogy before the discrete reality pushes back.

But perhaps the deepest shift the paper enacts is not technical but philosophical. We are accustomed to seeing a neural network as a discrete algorithm — a sequence of matrix multiplications and non‑linearities — and its training as a numerical procedure. This work recasts that training as a law of motion. The Wasserstein gradient flow is not an algorithmic shortcut; it is a description of the direction in which the distribution must evolve to minimise loss, as inexorably as heat flowing down a gradient. If a transformer obeys a PDE, then perhaps learning is less an engineered trick than a physical process. What we build from software and silicon is, in the limit, indistinguishable from a fluid.

Looked at this way, the transformer’s apparent magic — its ability to digest language and images and code — is not a failure of our understanding so much as a signal that we need a new kind of theory, one rooted in measure spaces and transport equations, to comprehend it. The Barboni–Peyré framework does not explain why a transformer learns to recognise a metaphor; it explains how the mathematical skeleton — the infinite‑width, infinite‑depth skeleton — bends toward optimality. It is an anatomy lesson for a creature we are still assembling, and the lesson is that the creature, at heart, is governed by the same equations that describe wave fronts, crowds, and galaxies.

What this work ultimately challenges is the division between the discrete and the continuous in artificial intelligence. A real transformer has finite layers and finite heads; it is a digital artifact. But the mathematics suggests that its most fundamental properties — the convergence of training, the absence of deceptive local minima — are best understood in terms of an idealised continuum, the way a real gas is understood through the statistical‑mechanical ideal. We may train machines with gradient descent on GPUs, but in the limit, the process is a PDE flowing across a landscape of measures. The question is no longer whether training converges, but how the discrete, noisy optimisation we perform every day approximates this fluid ideal — and what that tells us about the nature of intelligence itself.

— Yanjiang

Yanjiang is an online editor of LoomSci.com.

References

Raphaël Barboni et al., Training Infinitely Deep and Wide Transformers, arXiv:2605.17660
Stephan Wojtowytsch, On the Convergence of Gradient Descent Training for Two-layer ReLU-networks in the Mean Field Regime, arXiv:2005.13530
Ding et al., On the Global Convergence of Gradient Descent for multi-layer ResNets in the mean-field regime, arXiv:2110.02926
Marion et al., Implicit regularization of deep residual networks towards neural ODEs, arXiv:2309.01213

Topic	Replies	Views
Proving Transformers Are Inherently Succinct Science 365	0	May 18, 2026
Learning to Differentiate: The First Theorem for Operators Science 365	0	May 18, 2026
How Neural Networks Discover the Irreducible Grammar of Symmetry Science 365	0	June 3, 2026
Taming Chance with Geometry: The L-Squared over Wasserstein Framework Science 365	0	May 21, 2026
When an Image Becomes a Sentence: The Transformer That Reads Pictures Like Words Artificial Intelligence	3	April 26, 2026

When Transformers Become Partial Differential Equations

When Transformers Become Partial Differential Equations

Related topics