When Short-Sighted Control Learns to See Far

When Short-Sighted Control Learns to See Far

11 May 2026, Lynn

By learning a value function from expert demonstrations, a myopic controller makes far-sighted decisions in real time.

I remember the frustration vividly. In a controls lab at Tufts, wrestling with a simple temperature chamber, I’d written a beautifully tuned controller — until I added the discrete actuator. The heater could only be full-on, half-on, or off. Suddenly the smooth, continuous predictions of my model were useless. The controller had to pick a discrete move every second, and every choice closed off a dozen clever possibilities. I could make it work if I spent an hour computing the optimal sequence offline, but real time? Not a chance.

That problem — making optimal decisions when every command is a yes/no, left/right, on/off — is exactly what a team led by Dinesh Krishnamoorthy at Eindhoven University of Technology and NTNU has tackled. In a preprint (arXiv:2605.07401), Christopher Orrico, W.P.M.H. Heemels and Krishnamoorthy propose a way to make mixed‑integer nonlinear model predictive control practical for systems that cannot afford to wait.

If that phrase sounds like a mouthful, its meaning is simple. Model predictive control (MPC) is the workhorse of modern automation: look ahead a few time‑steps, simulate what will happen for each possible action, and pick the one that leads to the best future. It is how chemical plants stay stable, how drones stay airborne, how robotic arms move smoothly. But when the actions include discrete choices — open this valve or shut it, fire the thruster or stay silent — the underlying optimization becomes a mixed‑integer nonlinear program (MINLP), a beast so computationally hungry that solving even a modest‑size problem can take hours. For a satellite that must correct its attitude within milliseconds, that is a recipe for failure.

The Dutch‑Norwegian team’s idea is audaciously simple: make the controller short‑sighted on purpose. “This paper proposes a myopic MINMPC framework that incorporates value‑function approximation to substantially reduce the online computational burden,” they write. A myopic controller looks only a few steps into the future — sometimes just one — so the optimization is tiny and fast. But a myopic controller left to its own devices would drive the system off a cliff, because it never sees the danger coming. So the team gives it a cheat sheet.

That cheat sheet is a value function: a mathematical oracle that, for any state the system might be in, spits out a single number — the long‑term cost that will accumulate if you go on from here. If the controller knows the value of every possible consequence, it can make wise choices even with a one‑step horizon. It is a bit like a chess player who can only look two moves ahead, but hears a grandmaster whispering the estimated value of each resulting board position. The short‑range calculation stays manageable, but the whispering supplies the wisdom of the full game.

But where does that value function come from? Nobody told the controller what the real goal is. Instead, the team uses offline demonstrations. They run a full‑horizon, computationally painful MINMPC on a batch of scenarios and record the expert state‑action trajectories. Then, through a clever inversion of optimality conditions, they learn a value function that would have made the expert’s moves optimal. “A central feature is the dual treatment of discrete decisions,” they note: during offline learning, the integer constraints are relaxed so that the classic KKT (Karush‑Kuhn‑Tucker) optimality conditions can be applied, turning the problem into a differentiable one. The learned value function is then hardened again online — the real controller enforces the true integer constraints, while the value function supplies the long‑view that a short horizon had thrown away.

The result is a controller that, like an ant carrying a load many times its weight, handles decisions of enormous complexity with a surprisingly light computational burden. The team demonstrated the approach on two very different systems: the classic Lotka‑Volterra fishing problem, where a manager decides how much to fish each year without collapsing the ecosystem, and a satellite attitude control system with discrete thrusters. In both cases, the myopic controller tracked the expert almost exactly. Prey and predator populations matched the full‑horizon benchmark within a hair’s breadth, and the computation time per decision was slashed. For the satellite, the trajectories plotted on a polar diagram overlapped so closely that one could not tell them apart — yet the myopic controller delivered its decisions in a fraction of the real‑time budget.

What makes this powerful is not just the engineering convenience, but a shift in how we delegate foresight to machines. The offline phase distills the essence of what “good” means into a compact numerical artefact. Once trained, the controller no longer needs to solve the full optimization problem; it has internalised the expert’s values. The team writes that the learned value function induces a policy that is “approximately policy‑consistent with the expert demonstrations” — meaning the myopic controller tends to make the same choices the full‑horizon expert would, even though it sees so much less.

That word approximately matters. The approach is not perfect; there is no mathematical guarantee that the learned value function will never lead the controller astray in unseen corners of the state space. But for many real‑world systems — power converters, autonomous vehicles, drones with discrete actuators — an approximation that can run in real time is worth more than a perfect solution that runs overnight. The team’s work is not about replacing rigorous optimization with black‑box learning; it is about marrying the two so that the strengths of each cover the other’s weaknesses.

Think of it like a student who watches a master craftsman at work. The student never asks “what is the rule?” but simply absorbs hundreds of examples. Eventually, without a single explicit instruction, the student’s hands instinctively know which cut is right. That gut feeling is the learned value function — a compressed, implicit understanding of what matters. In the online moment, the student acts with only a glance at the immediate task, yet the whole philosophy of the craft guides the blade. This is not intuition; it is distilled optimality. And it arrives without a single equation being written inside the real‑time controller.

The implications ripple beyond the specific papers’ examples. As autonomous systems become more complex and safety‑critical, the ability to transfer human‑demonstrated expertise into a lightweight, real‑time decision‑maker could close a stubborn gap. The team’s preprint does not claim to have made MINMPC universally feasible — the tests are still on systems of modest dimension — but it offers a principled template: watch the expert, relax the integers to learn the value, then re‑tighten them for action. The controller becomes short‑sighted by design, yet sees farther than any purely myopic planner ever could, because it stands on the shoulders of a teacher who already walked the whole path.

The road from these first demonstrations to a satellite that learns its own attitude controller from a handful of recorded manoeuvres is still ahead. But the direction is clear, and for the first time, the map includes a route that keeps integer decisions firmly in the loop. That is no small thing for an engineer who, years ago in a university lab, watched a simple controller grind to a halt because it did not know which switch to flip.

Lynn is an online editor of LoomSci

References

  • Christopher A. Orrico et al., Learning myopic mixed-integer nonlinear model predictive control from expert demonstrations, arXiv:2605.07401