A Scaffold for Discovery: How AI Learns Natural Product Chemistry

A Scaffold for Discovery: How AI Learns Natural Product Chemistry

11 May 2026, Lynn

heading

A foundation model named NaFM learns the evolutionary grammar of natural products, mapping their scaffold families into a chemical landscape defined by biological source.

What if a molecule could tell you about its evolutionary past, its biological source, and its potential to become a drug—all encoded in the atoms of its structure? The natural products produced by microorganisms, plants, and animals have been shaped by billions of years of evolution into a vast chemical library, each compound carrying a record of the biosynthetic machinery that built it. Yet the deep-learning methods that have transformed small-molecule chemistry in recent years have largely treated these compounds as just another dataset, ignoring the evolutionary grammar that organizes them. Now, a team led by Yuheng Ding and Zhenming Liu at Peking University, with collaborators at the University of Washington and Xi’an Jiaotong University, has built a foundation model that learns to read that grammar directly. Their work appears in a preprint (arXiv:2503.17656) and suggests that by paying attention to molecular scaffolds—the shared structural cores that evolution conserves and decorates—AI can begin to see natural products not as isolated entries in a database, but as members of chemical families with deep biological histories.

The challenge the team confronts is this: existing deep-learning approaches to natural-products research are almost exclusively supervised, each model trained for a single narrow task—classifying a superclass, predicting a bioactivity, screening for a specific target. As the authors write, this “one-model-for-a-task paradigm often lacks generalizability and leaves significant room for performance improvement.” A model that learns to distinguish alkaloids from terpenoids learns nothing about which organisms produce them, or which enzymes stitch them together, or which scaffolds tend to yield antivirals. Each task is an isolated island. The team’s bet is that a foundation model—pretrained on a large corpus of natural products with objectives that force it to understand scaffold and side-chain relationships—can provide a shared chemical language that benefits many downstream tasks at once. This is not a new idea in AI: language models have shown that pretraining on a general objective can produce representations that transfer across tasks. But natural products, with their scaffold-driven biosynthesis, demand a pretraining strategy that respects biology’s modular logic.

The key insight is that evolution works with a scaffold-first architecture. A molecular scaffold is a shared ring system or skeleton—think of it as an architectural blueprint. Side chains, the functional groups that branch off, are like the furniture that gives each room its specific purpose. The same scaffold can appear in hundreds of different natural products, each with different decorations, produced by different organisms, and exhibiting different bioactivities. If you mask a few atoms in the scaffold, the model must reconstruct them by understanding the overall structural context—a task that, the team argues, teaches the network to capture deep topological relationships. At the same time, the model is trained to do contrastive learning: pairs of molecules that share a similar scaffold are pulled closer in representation space, while those that differ are pushed apart. The learning is scaffold-aware: not all scaffolds are treated equally; the weighting reflects how evolutionarily related they are. This dual objective—masked graph reconstruction and scaffold-aware contrastive learning—forms the core of what the team calls the Natural Products Foundation Model, or NaFM.

fig1

Natural product scaffolds serve as a blueprint: the model predicts missing atoms and bonds while learning which scaffold features are most important.
This reveals the central role of scaffolds in linking a molecule’s origin, biosynthesis, and activity—a crucial step for designing new natural-product-inspired drugs. (Source: arXiv:2503.17656)

Imagine teaching a student to recognize families of pottery by showing them shattered fragments and asking them to reconstruct the original vessel, while simultaneously clustering shards from the same workshop together. The student would learn not only the shapes of complete pots but also the distinctive thumbprints of the potter—the handles, the glazing patterns, the clay composition. NaFM does something analogous with molecules: the masked graph objective teaches it local chemical grammar, while the contrastive objective teaches it the global signatures of biosynthetic pathways. Unlike a sculptor who works outward from a block, however, the model isn’t building a molecule; it’s learning a compressed mathematical representation that captures the essential features needed for downstream tasks. The distinction matters: the model’s “understanding” is not conceptual but statistical, a point the authors are careful to note.

The tests of this understanding are striking. When the team fine-tuned NaFM for taxonomy classification—assigning a natural product to its superclass or biosynthetic pathway—it outperformed existing specialized tools across the board. On one benchmark, it improved the area under the precision-recall curve by up to 24 percent for amino acids and peptides compared to NPClassifier, a widely used standard. Across all biosynthetic pathways, NaFM consistently topped the competition, and at the superclass level, it achieved higher scores in 53 of 71 categories, with particularly strong gains among carotenoids, alkaloids, and peptide-related classes. These numbers tell a story of a model that has genuinely learned the chemical dialects of different biological kingdoms.

But the most visually arresting evidence comes from what the model does without fine-tuning. The team embedded thousands of compounds from the Natural Products Atlas using the pretrained NaFM representations, then projected the high-dimensional vectors down to two dimensions. The resulting map resolved into distinct clusters corresponding to biological sources—animal, bacterial, fungal, plant, and chromistan—each occupying its own chemical territory. It is a map of life drawn in the language of atoms, and it reveals that the model has internalized something profound: the structural signatures that separate a plant metabolite from a microbial one are not arbitrary; they are the fossil record of divergent evolutionary paths. One notable failure case underscores the rule: a steroid-like scaffold appears smeared across multiple source clusters, because nature has independently arrived at that core structure in many lineages. The model’s confusion is, in a sense, a reflection of evolutionary truth.

fig3

Natural products cluster by biological source after training, revealing distinct chemical signatures for animals, bacteria, fungi, and plants. This organization improves predictions of the gene clusters that produce them, enabling faster discovery of new drug leads. (Source: arXiv:2503.17656)

Such representations also show promise in drug discovery. The team tested NaFM in a virtual screening scenario, training it to predict inhibitors of a specific protein target and then using it to rank candidate molecules from a library. The top-scoring predictions were then validated with full molecular dynamics simulations, computing the binding free energies between each candidate and the target. The model-selected compounds showed favorable energies, in some cases comparable to known active controls, and formed specific hydrogen-bonding networks with key residues in the binding pocket. When the team compared the enrichment factors—essentially, how many more true inhibitors the model would find compared to random screening—NaFM consistently outperformed the commercial docking tool Glide across seven different targets. The model had learned not just to recognize scaffolds but to infer their pharmacological relevance.

Here the dialectic tightens. A skeptic—or a careful chemist—might point out that virtual screening is notoriously fickle; molecular dynamics simulations, while rigorous, are still approximations, and the binding free energies rely on force fields that are calibrated for protein-ligand systems rather than natural-product chemotypes. The team’s own data shows that the improvement over Glide, while consistent, is not dramatic in every case, and the absolute number of validated hits remains modest. Moreover, the foundation model itself is trained on known natural products, which overwhelmingly come from terrestrial sources and are heavily biased toward compounds that are easy to isolate and characterize. The chemical space of “all possible natural products” is far larger, and the model’s ability to generalize to genuinely novel scaffolds—ones with ring systems unlike anything in the training set—remains an open question. The authors acknowledge these limitations with the matter-of-factness of engineers who know the road ahead is long.

But the counterargument is not a dismissal; it is a demarcation of the problem space. What the team has achieved is not a ready-to-use drug-discovery pipeline but a proof of principle that pretraining on scaffold-level information yields representations that are richer than those derived from standard molecular fingerprints or from supervised training alone. The fact that these representations cluster by biological source and improve predictions across multiple downstream tasks suggests that the model has captured a genuine latent structure in the chemistry of life—a structure that evolution itself has been encoding for billions of years.

Evolution, after all, is nature’s own foundation model. It operates not on molecules but on lineages, modifying and recombining scaffolds over geological timescales, testing every variation against the environment. A billion years of chemical trial and error have produced a space of molecules that are not randomly distributed but organized around conserved cores. The team’s work suggests that this organization can be learned, that the scaffold is not just a convenient drawing convention but a functional unit of biological information. The same mathematics that underlies modern dimensionality reduction—principal components, manifold learning—finds an echo in the branching trees of biosynthetic pathways. The model does not, of course, understand evolution in any conscious sense; it has no knowledge of the organisms or the enzymes. But its internal representations trace the outlines of that evolutionary history, as if the chemical structures were a language and the model had learned its grammar.

The road ahead is clear in its direction if not its precise milestones. The model must be tested against truly novel scaffolds, expanded to include more extensive data on biosynthetic gene clusters, and integrated with experimental validation at scale. The dream of using AI to navigate natural-product chemical space—to identify new antibiotics, antivirals, or anticancer agents before we run out of tractable targets—is not yet realized. But the team has given the field something it lacked: a general-purpose chemical language for the products of life. What the large language models did for text, NaFM may do for the molecules that nature has written. The question is no longer whether deep learning can help drug discovery; it is how to teach the models to read. This work is a first lesson.

Lynn is an online editor of LoomSci

References

  • Yuheng Ding et al., Pretraining a Foundation Model for Small-Molecule Natural Products, arXiv:2503.17656