Learning to Evolve: Strategy Genes Outperform Skills

Learning to Evolve: Strategy Genes Outperform Skills

28 May 2026, Yanjiang

heading

Compact, gene-like memory guides AI agents to outperform verbose skill packages in scientific coding, revealing that representation structure matters more than raw information.

What is the best way to encapsulate an AI agent’s hard-won experience so that it can guide future decisions? The intuitive answer — one that might have occurred to any engineer — is to write it all down. Give the agent a full manual: an overview of the problem, a step‑by‑step workflow, a catalogue of common pitfalls, precise instructions for handling errors, and a detailed API reference. When the agent faces a similar coding task, it simply consults this thick file and follows the instructions. Naïvely, one would expect that more documentation means better guidance, and that the agent would thank you for the thoroughness.

A preprint (arXiv:2604.15097) from Junjie Wang, Yiming Ren, and Haoyang Zhang at Infinite Evolution Lab, EvoMap, and Tsinghua University suggests that something close to the opposite is true. Across 4,590 controlled trials spanning 45 scientific code‑solving scenarios, they find that a compact, gene‑like representation — barely 230 tokens long — consistently outperforms a 2,500‑token documentation‑heavy skill package. Not only does the verbose guide fail to help; on average it actively degrades performance, dragging the agent below a baseline that had no experience at all. The paper is a provocation dressed as a measurement: it asks us to reconsider what it means for an AI system to remember, and to ponder whether the way we store experience might matter more than the experience itself.

Think of teaching an apprentice. The old guild tradition gave novices a slim journal of condensed rules: never heat the solder too long; clean the tip each time; the joint should shine like a mirror. A modern training manual, by contrast, would add a chapter on the history of soldering, a decision tree for choosing flux, a troubleshooting table for seventeen failure modes, and a dozen pages of API notes. Which document produces a better apprentice? The guild journal wins not because it contains more truth — the manual contains truths too — but because its compact, action‑oriented structure lets the learner absorb the signal without being drowned by the noise. The AI agents in Wang and colleagues’ study behave similarly: the skill packages, meant to be comprehensive, instead become a burden, their useful signal scattered among neutral and sometimes harmful sections. The gene, by contrast, is lean, focused, and — crucially — mutable.

This metaphor, helpful as it is, oversimplifies. An AI agent’s test‑time control is not an apprentice’s learning in the human sense; it is a constrained optimisation process where a language model integrates a fixed textual prompt with its own internal knowledge. The virtue of compactness is not about human cognitive load but about minimising interference and preserving the model’s ability to reason natively. So while the apprentice analogy captures the outcome, the mechanism lies deeper: in the representation’s alignment with how the model actually processes guidance.

Wang and colleagues did not begin with the gene. They began with the skilled labour of preparing documentation‑style experience packages — what they call a Skill. A full Skill includes an overview, a workflow, a list of pitfalls, an error‑handling guide, and an API reference. The team then let agents attack 45 scientific coding scenarios drawn from a benchmark whose granularities range from simple end‑to‑end problems to multi‑step challenges. The result was disquieting: under Skill guidance, the agents lost 1.1 percentage points of overall accuracy compared to the no‑guidance baseline. When the researchers dissected the Skill into its constituent sections, they found that only a narrow procedural slice was clearly useful; the other sections were neutral or even harmful. Extracting that slice into a fragment improved matters, but it still fell short of what a deliberately compact gene could achieve.

fig1

A compact Gene representation boosts performance by 3 percentage points, while a bulky Skill package drops it by 1.1 points. This shows that minimal, control-focused strategies outperform extensive documentation for real-time adaptation. (Source: arXiv:2604.15097)

This is where the dialectic sharpens. One could argue, fairly, that the Skill used here is a naive instantiation — a straightforward assembly of documentation chunks, not the more sophisticated protocolised skill representations that earlier work has explored. Zhou and colleagues, in their Memento‑Skills framework (arXiv:2603.18743), designed skills as structured protocols produced by a meta‑agent, complete with curated tool sequences and verification checkpoints — and their paper has five authors. Wang’s team acknowledges that they did not test such advanced formulations, and it remains an open question whether a carefully engineered Skill could close the gap. But the strength of their study lies not in a final verdict on all possible skills, but in a systematic demonstration that representation itself is a first‑order factor. The same information, organised differently, leads to dramatically different outcomes.

The researchers pressed this point further with a series of stress tests. They deliberately corrupted the gene’s content — feeding it the wrong algorithm, pointing it to the wrong domain — and performance suffered, as one would expect. But when they distorted the gene’s structure — inverting the priority of instructions, tightening constraints beyond what was strictly necessary — the gene remained competitive, and in some cases even improved. This asymmetry implies that the gene’s effect is not tied to one fixed surface form; what matters most is whether the encoded experience remains task‑appropriate. A Skill, by contrast, appears brittle: its usefulness depends on a particular arrangement of its sections, and even small additions can dilute the signal.

Condition Avg. Delta
Skill 49.9% -1.1
No guidance 51.0% 0.0
Gene 54.0% +3.0

Gene outperforms both no guidance and Skill, with the largest average gain. This shows that letting strategies evolve during testing beats sticking with fixed procedural skills. (Source: arXiv:2604.15097)

The implications reach beyond one‑shot control. The researchers turned the gene into a carrier for iterative evolution, attaching a record of past failures in the form of compact warnings. Here, the results became genuinely arresting. When failure history was appended as plain text to a Skill, it added little. When it was attached to a gene, the same information became potent: the gene’s editable structure allowed the warnings to be integrated cleanly, distilled down to the precise insights that prevented repeated mistakes. The gene‑evolved systems were then tested on a demanding benchmark called CritPt. Over the course of a few months, two separate evolution runs each lifted their respective base models by more than nine percentage points — a gain that, while modest in absolute terms, represents a substantial relative jump from initial accuracies in the neighbourhood of 10 to 20 percent. The stronger evolved version reached an accuracy of more than one in four, illustrating that gene‑based test‑time evolution is not a theoretical curiosity but a measurable, repeatable phenomenon.

What all this points toward is a quiet but consequential shift in how we think about AI experience. For years, the dominant paradigm has treated memory as a retrieval problem: store everything, and when the time comes, fetch the most relevant snippet. The Skill package, with its exhaustive documentation, is a natural expression of this mindset. Wang and colleagues’ gene represents a different philosophy — one borrowed, consciously or not, from biology. A biological gene is not a full‑text manual for building an organism. It is a compact, evolvable set of constraints that, when placed in the right cellular environment, guides development. It can be mutated, recombined, and selected. The gene representation they deploy has precisely these properties: it is compact, control‑oriented, and designed to be a substrate for evolution, not a static archive.

An important question raised by earlier work on experience reuse — notably the protocolised skills of Memento‑Skills — is whether this compactness trades away the richness that documentation can provide. Could a gene, by omitting the “why” behind a solution, lead the agent astray in novel situations? The study’s answer is partial but suggestive: when the gene was stripped of its domain knowledge and given the wrong algorithm, performance indeed collapsed, confirming that content matters. But the gene’s resistance to structural distortion implies that rich documentation may be overkill for many routine tasks; what the agent needs is not the full story but the decisive move. It is a classic precision‑vs‑recall trade‑off, with the gene betting on precision and winning across a broad test bed.

Perhaps the most thought‑provoking finding is not the numbers but the lesson embedded in the gene’s ability to absorb failure. The researchers found that simply appending a failure log did little; the effective intervention was to distil those failures into compact warnings — a sentence or two that captured the essence of what went wrong. In other words, the gene works not because it carries more information, but because it carries the right information, shaped by an evolutionary pressure that rewards actionable knowledge. This resonates with what we know about human expertise: a novice reads the entire manual; a master mutters, “never use that solvent, it degrades the seal.” The master’s sentence is a gene‑like warning, compact and evolution‑ready, capable of being passed on and refined.

The road ahead is open. The authors openly note that they have not tested the most advanced skill protocols, and that their benchmark, while broad, still lives inside the controlled world of scientific coding. Whether the gene advantage persists when the tasks become more open‑ended — and whether protocolised skills can be redesigned to incorporate gene‑like compactness — are questions that follow naturally from this work. In the deeper sense, the study invites the AI community to stop treating experience as something to be stored and start treating it as something to be evolved. It suggests that the core problem in experience reuse is not how to supply more experience, but how to encode it as a compact, control‑oriented, evolution‑ready object. When you change the representation, you change what the agent can become.

— Yanjiang

Yanjiang is an online editor of LoomSci.com.

References

  • Junjie Wang et al., From Procedural Skills to Strategy Genes: Towards Experience-Driven Test-Time Evolution, arXiv:2604.15097
  • Zhou et al., Memento-Skills: Let Agents Design Agents, arXiv:2603.18743