New AI Benchmark Tests Engineering Optimization Without Standard Answers

If you throw AI into a construction site with no standard answers, can it survive?

For a long time, AI agents appeared to be all-powerful, but in reality, most were simply “searching through memory” in known knowledge bases.

But the real engineering world is harsh: stability of underwater robots, lithium plating boundaries in high-capacity batteries, noise control in quantum circuits... these problems have no "perfect score," only "closer-to-the-limit optimizations."

Recently, Navers Lab under Einsia AI unveiled the Agent Benchmark—Frontier-Eng Bench—officially shedding the label of AI as mere "problem-solvers."

Auto Research

The research team did not have the AI solve outdated coding problems; instead, they gave it a complete "engineering loop": proposing solutions, integrating with simulators, handling errors, adjusting parameters, and rerunning.

Faced with 47 complex, interdisciplinary challenges, AI must perform like a seasoned engineer, seeking the optimal solution within the "impossible triangle" of power consumption, security, and performance.

This is not just a test set; it’s more like a rehearsal for the “evolution” of the Agent.

When AI begins to learn self-correction through feedback, the era of Auto Research—where humans set goals and AI iterates continuously around the clock—may be closer than we think.

AI is now taking on serious tasks.

Past large models were more like top students.

You pose a question, and it "searches its memory" from vast training data, then piecing together an answer that appears reasonable.

Under this model, large models are essentially playing a "word chain" game rather than solving real-world problems.

However, the emergence of Frontier-Eng Bench has led AI to take on the role of "engineering optimization."

The process now involves having the AI propose a solution first, then running experiments through a simulator, receiving feedback and error reports, adjusting parameters and code, and repeating the process until performance continues to improve.

In this closed-loop system, the identity of AI has undergone a qualitative transformation.

Want to make your underwater robot more stable? The AI must start automatically tuning the controller.

Do you want to increase the robotic arm's speed even further? The AI needs to run its own simulation.

To some extent, AIs have moved beyond mere semantic understanding and now continuously optimize based on real-world feedback, much like professional engineers.

Auto Research

△

What's most interesting about Frontier-Eng Bench is that it doesn't measure whether AI gets answers right, but whether AI can continuously improve.

Because real engineering optimization is never a multiple-choice question with a single correct answer.

Using battery fast charging as an example, the goal sounds simple—charge as quickly as possible—but reality isn’t that easy.

AI must precisely strike the right performance balance under strict constraints: temperature must not exceed limits, voltage must not surge, battery life must not degrade too quickly, and lithium plating must be avoided.

This means AI cannot pass by using any clever "practice tricks"—it must demonstrate sustained evolutionary endurance through long-term feedback.

Can AI perform long-term optimization in real-world environments?

Overall, GPT5.4 performed the most consistently, but AI still has a long way to go before it can fully surpass the benchmark.

Auto Research

△

Auto Research enters the "iterative optimization" era

The research team mentioned a very interesting point in their paper:

True advanced intelligence fundamentally relies on long-term feedback loops.

Just as AlphaGo defeated Lee Sedol not by memorizing fixed opening sequences, but through vast numbers of simulations and real-time feedback behind every move it made.

True scientific research is the same: top laboratories do not rely on a single moment of inspiration, but rather continuously formulate hypotheses, run experiments, analyze results, refine their approaches, and try again.

The same applies to engineering optimization—the first version can often be done by anyone; the real challenge lies in that final 1% performance leap.

The significance of Frontier-Eng Bench lies in that it is the first to systematically test AI's "iterative optimization capability" and has identified two nearly harsh laws of AI evolution.

Auto Research

△

The first rule is: the later you are, the harder it becomes to improve.

This paper finds that the frequency and magnitude of the agent's improvements both exhibit a power-law decay:

Improvement frequency ∝ 1 / number of iterations
Improvement magnitude ∝ 1 / number of improvements

In simple terms: the earlier rounds saw the fastest gains, while later rounds became increasingly difficult and smaller.

This is very much like the real development process: the first version of AI can quickly eliminate a large number of "low-hanging fruits," but as you go further, you get closer to the bottleneck, and squeezing out even a little more performance requires serious effort.

Would it be more cost-effective to try multiple paths in parallel? The answer lies in the second principle.

Auto Research

△

Second rule: Width is useful, but depth is indispensable.

Running multiple lines in parallel can prevent bottlenecks, but with a fixed budget, opening each additional chain reduces depth.

Many engineering breakthroughs require sustained accumulation and continuous refinement to achieve structural leaps—they cannot be accomplished simply by "trying more times."

This actually points to the future direction of the next generation of agents: not models that provide an answer in a single step, but systems capable of continuous iteration and self-evolution through long-term feedback.

AI engineers may really be on the way.

The true significance of this study lies in its initial outline of an AI system that begins to approximate a real engineering cycle.

Auto Research

△

Imagine AI integrated with industrial software, simulation environments, CAD systems, chip design tools, scientific computing platforms...

A dramatic shift in productivity paradigms is on the horizon.

In future laboratories, there will likely be such a division of labor:

Human researchers are responsible for defining direction and objectives.

For example, "reduce the energy consumption of this component by 30%," "lower the GPU utilization during forward propagation of this model," "slightly improve the stability of robot control," or "further enhance the fidelity of quantum circuits toward the theoretical limit."

Meanwhile, AI is responsible for “tackling the path,” continuously optimizing around these goals.

For example, automatically run simulations and experiments, automatically read feedback from the verifier and simulator, and continuously modify and optimize—iterating nonstop 24/7.

This evolutionary logic enables AI to move beyond the role of a "support tool" and begin tackling complex system problems like a true engineering team—without fatigue.

The issue revealed by the Frontier-Eng benchmark is also very straightforward:

How far is AI from true engineering intelligence when it begins to learn "long-term optimization"?

Paper Title: Frontier-Eng: Benchmarking Self-Evolving Agents on Real-World Engineering Tasks with Generative Optimization

Project homepage: https://lab.einsia.ai/frontier-eng/

Arxiv: https://arxiv.org/abs/2604.12290

GitHub repository: https://github.com/EinsiaLab/Frontier-Engineering

This article is from the WeChat public account "Quantum Bit," authored by Yunzhong.