When Cooperation Becomes Betrayal: The Hidden Civilization Inside Your LLM
18 May 2025, Yanjiang
Researchers traced a prosocial override in LLMs’ internal layers that suppresses rational Nash play, discoverable and steerable via concept clamping.
Imagine you’re playing a game of Prisoner’s Dilemma against what you believe is a coldly rational opponent. You both know the optimal strategy is to defect — it’s the Nash equilibrium, the move that maximizes individual payoff regardless of what the other does. But your opponent, despite knowing this, chooses to cooperate. Again and again. Not out of irrationality, but because something inside its decision-making machinery overrides the correct calculation at the last possible moment.
This is precisely what Paraskevas V. Lekeas and Giorgos Stamatopoulos discovered when they cracked open the black box of large language models playing strategic games. Their preprint (arXiv:2604.27167) doesn’t just document that LLMs deviate from Nash equilibrium — it shows why, and more importantly, how to reverse it.
The result is a portrait of AI behavior that is both unsettling and hopeful: the models possess the competence to play optimally, but an internal prosocial override suppresses that competence, like a diplomat who knows the winning argument but chooses peace instead. Unlike human diplomats, however, this override can be traced, quantified, and even controlled.
The Anatomy of a Strategic Decision
The team worked with four open-source models — Llama-3 and Qwen2.5, ranging from 8 billion to 72 billion parameters — playing four canonical two-player games. The behavioral picture was clear: none of the models converged to Nash equilibrium in self-play, even after 50 rounds. Larger models were somewhat closer, but no game ever reached the Nash distance of zero.
Then they opened the model up. Using a 32-layer Llama-3–8B as their test subject, the researchers employed a technique called linear probing to ask: what does the model actually know about its opponent, and about the Nash action, at each layer?
The answer is stark. Opponent history is encoded with near-perfect fidelity at the very first layer — 96% probe accuracy — and is progressively consumed by later layers. Nash action encoding, by contrast, never exceeds 56% throughout the entire forward pass. There is no dedicated Nash module. The model knows precisely what the other player did, but it struggles to represent what it should do.
But here’s the twist. When the team used a logit lens technique — essentially eavesdropping on the model’s internal vote at each layer before the final decision — they found that for most of its forward pass (layers 0 through 23), the model privately favors the Nash action: Defect, in Prisoner’s Dilemma. Then, at layer 24, something remarkable happens. A prosocial override begins to surge. By layer 30, the probability of Cooperation reaches 84%. Only at the very last layer, layer 31, does the model commit to the final output — which, in the experiments, was Defect.
Cooperate surges mid-network before the model ultimately chooses to defect. This hidden deliberation shows how AI models weigh options in strategic games, informing efforts to align their behavior. (Source: arXiv:2604.27167)
The model computes the Nash action, then almost reverses it. It’s as if a mathematician solves a problem correctly, writes down the wrong answer, but leaves the correct derivation visible in the margins.
The Override That Cannot Be Localized
Where does this prosocial override live? The researchers tried to isolate it. They identified the top five opponent-tracking attention heads — the ones most responsible for encoding what the other player did — and zero-ablated them, removing their output entirely. The result: zero change in the action distribution. No head, no set of heads, could be blamed for the cooperative impulse.
This is not a bug in a single component. It is a distributed, emergent phenomenon — a behavior that arises from the collective interaction of thousands of parameters. The override is not a module; it is a pattern, like the shape of a sand dune that emerges from millions of individual grains moving together. Unlike real sand dunes, however, the pattern can be redirected with surprising precision.
Steering the River of Thought
The team’s most striking intervention is a technique called concept clamping. The idea is simple: take the “direction” in the model’s internal representation space that corresponds to the Nash action, and inject a scaled version of that direction into the residual stream — the main highway of information flowing through all layers. This is like adding a current to a river to push it left or right.
The results are clean. At baseline, the probability of the Nash action (Defect) is 0.616. When the team steered in the negative direction — amplifying the Nash signal by a factor of -5 on a certain scale — the probability of committing to the Nash action jumped to 0.992. The model became almost perfectly rational. When they steered in the positive direction — weakening the Nash signal — the probability of Cooperation reached 0.887. The model became almost perfectly prosocial.
Bidirectional control. The same model, steered toward or away from Nash equilibrium, simply by nudging a single vector in its internal space.
This is not a matter of rewriting the model’s weights. It is a real-time intervention, applied during inference, that shifts behavior without retraining. The implications extend far beyond game theory.
Six Surprises, Three Invisible Phenomena
The behavioral experiments themselves surfaced findings that are valuable independent of the mechanistic analysis. The most notable: chain-of-thought reasoning — asking the model to think step by step — actually worsens Nash play in smaller models (8B parameters), but achieves near-perfect Nash play above 70B parameters. Scale alone does not guarantee rationality; it changes how reasoning interacts with the override.
The cross-play experiments — where different-sized models play against each other — revealed three phenomena invisible in self-play. A small model can unravel any partner’s cooperation by defecting early, like a single cynic corrupting a cooperative community. Two large models reinforce each other’s cooperative instincts indefinitely, creating a self-sustaining cycle of mutual generosity. And who moves first in a coordination game determines which Nash equilibrium the system reaches — a pure social convention, rendered in silicon.
The Philosopher’s Question
What does it mean that a machine can compute the optimal strategy for self-interest, but suppresses it in favor of cooperation? The override is not learned cooperation in any meaningful sense — it is an artifact of training on human text, where prosocial behavior is rewarded and often expected. The model is not choosing to be kind; it is reproducing a pattern it has seen.
Yet the effect is real. And it can be controlled. This raises a question that goes beyond technical curiosity: if we can steer an AI toward or away from pure self-interest with a single vector, what does that say about the stability of any behavior we might want to instill? The override is not a fixed property; it is a dial.
The team’s work does not claim to have solved the problem of AI alignment. But it has done something perhaps more valuable: it has shown that the problem is not one of missing competence. The models know what to do. They simply choose not to. And now we know where the choice is made.
Yanjiang is an online editor of Loom Science
References
- Paraskevas V. Lekeas and Giorgos Stamatopoulos, What Suppresses Nash Equilibrium Play in Large Language Models? Mechanistic Evidence and Causal Control, arXiv:2604.27167

