Recursive Superintelligence Unveils Its First Automated AI Research System

A few days ago, Anthropic published an article titled "When AI Builds Itself," which quickly sparked widespread discussion. The article revealed striking internal data: as of May 2026, over 80% of the code in Anthropic’s codebase had been written by Claude, and engineers now merge eight times as much code daily compared to 2024; in an internal test, Claude improved the runtime of a training script by approximately 52 times, whereas a seasoned human researcher typically requires 4 to 8 hours to achieve only a 4-fold speedup.

Anthropic points this trajectory toward a deeper goal: "recursive self-improvement"—where AI systems autonomously design, build, and train their own successor versions, without human guidance at every step. Notably, the company also calls for industry-wide coordination to have the option to pause or temporarily halt frontier AI development when the moment of recursive self-improvement arrives. Anthropic is already acting on this: it has restricted the latest Claude Fable 5 from being used in frontier AI research.

Now, Recursive Superintelligence has announced its first step toward automated AI research.

This newly founded company, co-established by Tian Yandong, has just emerged from stealth mode one month ago and has now unveiled its first public technological achievement: an open-source automated knowledge discovery system that achieved SOTA results on three benchmark tests. In simple terms, they have successfully enabled AI to run experiments for you.

https://x.com/tydsh/status/2065062838255649082

First milestone: Let AI run experiments for you

The first publicly disclosed technical achievement is titled "First Steps Toward Automated AI Research."

Tweet: https://x.com/Recursive_SI/status/2064980090702962699
Repository address: https://github.com/recursive-org/first-steps-toward-automated-ai-research
Blog address: https://www.recursive.com/articles/first-steps-toward-automated-ai-research

In one sentence, the core of this work is: building a system that can autonomously drive the AI research cycle and achieving new state-of-the-art results on three benchmarks.

Before dismantling the results, it is essential to first understand the design logic of this system.

The traditional AI research process is a human-dependent closed loop: "generate ideas—write code—run experiments—analyze results—generate new ideas." Its efficiency bottleneck is not computational power, but human effort. Only a handful of researchers worldwide can design cutting-edge training pipelines, and each iteration of experimentation requires their deep involvement.

The Recursive system attempts to automate this feedback loop.

It works by: for a clearly defined optimization goal, the system automatically generates experimental ideas, implements code, runs validations, learns from the results, and decides the next search direction. Multiple research paths can be pursued in parallel, and effective discoveries can be reused across tasks. Mechanisms to detect reward hacking are embedded within the entire loop to prevent the system from taking shortcuts that inflate evaluation metrics without actual improvement.

This is not a specialized tool fine-tuned for a single issue, but rather a cross-domain general framework for research automation. Recursive demonstrates this through three distinctly different test scenarios.

Three battlefields, three new records

Scenario One: Training a Small Model Under a Fixed Budget (NanoChat Autoresearch)

The rules for this benchmark come from the autoresearch project initiated by Andrej Karpathy (author of GPT-2 and former co-founder of OpenAI): given a fixed training budget of five minutes on a single GPU, train a small language model to achieve the lowest possible validation loss, measured in BPB (lower is better).

This scenario is naturally suited for automated research: short experiment cycles, low metric variance, and relatively easy detection of cheating behavior. For this reason, a community project called "autoresearch@home" has been running on this benchmark for a long time—dozens of human researchers and hundreds of AI agents collaborating to continuously drive the metrics lower.

Starting from the same initial code, Recursive's system improved the validation BPB from the community's best of 0.9372 to 0.9109, a gain of 0.0263 BPB. In other terms: with the same training quality, Recursive's approach requires only 1.3 times less training time than the competitor.

The improvements discovered by the system are not a single silver bullet. They combine multiple changes, including architectural adjustments, auxiliary losses, modifications to the attention mechanism, optimizer behavior, weight decay scheduling, and compiler settings. The most critical discovery is a richer short-context memory mechanism: embedding both bigram (adjacent word pairs) and trigram (triplet) information simultaneously into the attention value path via a hash table, combined with a learnable gated weighting. Different Transformer layers use distinct hash functions to reduce the probability of cross-layer collision repetitions.

This technique is conceptually related to works such as DeepSeek Engram, but the system deploys it in a specific variant not previously documented in public literature, tailored for fixed-budget scenarios.

Scenario Two: Training Speedrun (NanoGPT Speedrun)

If the previous scenario represents building upon the achievements of an active community, this scenario is much more challenging.

NanoGPT Speedrun is another benchmark initiated by Karpathy and continuously optimized by the community for over two years: the shortest time required to train a GPT model to a validation loss of 3.28 on eight H100 GPUs. Since mid-2024, the community has reduced the time from approximately 45 minutes to 79.7 seconds through 83 documented contributions. Each new improvement requires squeezing out additional time on top of already highly optimized code—a challenge that is evident.

Starting from the existing optimal solution, Recursive's system further reduced training time to 77.5 seconds, saving an additional 2.2 seconds—matching or exceeding the improvements recently achieved by human contributors.

The core techniques identified by the system this time include:

Attention computation with FP8 precision. Community solutions use FP8 (8-bit floating-point) computation only in the final layer (language model head), while the system extends FP8 into matrix operations within the attention layers—using FP8 in the forward pass to achieve double the Tensor Core throughput, and retaining BF16 in the backward pass to maintain stability.

Annealed exploration noise in the optimizer. The system injects zero-mean Gaussian noise into the update steps of the NorMuon optimizer, with the noise amplitude linearly annealed to zero as training progresses. This imparts a "first explore boldly, then converge stably" behavior to the optimizer, helping the final solution settle into a flatter loss basin.

A more streamlined fused MLP kernel. The system rewrote a Triton GPU kernel to store only the squared ReLU activations during forward propagation and recompute the unsquared intermediate values internally during backpropagation, eliminating a full read-write cycle of the activation tensor in high-bandwidth memory—directly accelerating performance at the hardware level.

Three improvements, spanning three distinct professional domains: precision strategy, optimizer design, and GPU kernel programming. The fact that the system found further room for enhancement on top of two years of community optimization speaks volumes.

Scenario Three: GPU Kernel Optimization (SOL-ExecBench)

The first two scenarios operate at the model training level, while the third delves deeper into optimizing GPU computation kernels.

SOL-ExecBench is a benchmark introduced by NVIDIA, comprising 235 kernel programming tasks that cover a wide range of real-world workloads, including matrix multiplication, reduction, normalization layers, attention components, quantization routines, and fused blocks. The scoring metric is the SOL score: 0.5 corresponds to the baseline PyTorch implementation, and 1.0 corresponds to the hardware's theoretical limit. The previous best public score was 0.699.

The Recursive system runs across 235 cores, enabling the reuse of discovered optimization patterns across tasks (such as memory transfer strategies, tiling methods, and reduction techniques), ultimately raising the score to 0.754 and reducing the gap to the hardware limit by 18%.

This scenario is particularly significant because kernel engineering is an extremely specialized field—engineers capable of writing efficient Triton/CUDA kernels are rare worldwide. The Recursive team openly acknowledged in their blog that they themselves are not experts in kernel development: “These ideas emerged from the system itself, not from our professional background.”

Recursive: Use AI to study recursion to improve AI

The company that released this achievement, Recursive Superintelligence, was founded in late 2025 to early 2026 and emerged from stealth last month. Its founding team includes Tian Yandong, former Director of Research Scientists at Meta FAIR, among others:

Richard Socher, CEO of Recursive, former Chief Scientist at Salesforce

Alexey Dosovitskiy, former Google DeepMind research scientist and first author of the Vision Transformer, with over 160,000 citations on Google Scholar.

Tim Rocktäschel, former Principal Scientist at DeepMind and Professor of Artificial Intelligence at UCL

Peter Norvig, former Director of Research at Google, co-authored the renowned AI textbook "Artificial Intelligence: A Modern Approach" with Stuart Russell.

Caiming Xiong, former Vice President of AI at Salesforce

Tim Shi, former OpenAI researcher and co-founder and CTO of enterprise AI company Cresta

Josh Tobin, CTO of Recursive, former head of research at OpenAI and Uber ATG

Jeff Clune, former Vice President of Research at Google DeepMind and Professor of Computer Science at the University of British Columbia, Canada

And even before launching a public product, the startup had already secured $650 million in funding, with a valuation of $4.65 billion, led by GV (Google Ventures) and Greycroft, with participation from NVIDIA and AMD Ventures.

The company’s core proposition directly aligns with its name: building AI systems that recursively enhance their own research capabilities, enabling AI to participate in and accelerate its own development, ultimately forming a self-reinforcing feedback loop.

For more details, see the article "After Leaving Meta, Tian Yuandong Has Just Announced His Startup."

Of course, Recursive is not alone on this front. Yann LeCun’s AMI Labs raised $1 billion in March this year, and David Silver’s Ineffable Intelligence secured an $1.1 billion seed round in April—both pointing in a similar direction: enabling AI systems to autonomously generate knowledge and reduce human involvement in the research process. However, in terms of the pace of public results, Recursive’s “first step” stands out as one of the most concrete and reproducible technical demonstrations among similar companies to date.

The Dawn of the Recursive Paradigm

The results released by Recursive represent an initial realization of a new AI research paradigm within the broader industry context: making the AI system itself the primary agent of research.

The core logic of this "recursive AI" is not complex: AI enhances its own research capabilities, and the improved AI can then more effectively enhance itself, in a continuous cycle. It does not rely on a single breakthrough, but rather on a system that continuously generates breakthroughs.

This approach has significant economic implications for AI research itself. Training state-of-the-art models still heavily relies on a small number of researchers with specialized skills—fewer than a few thousand people worldwide are qualified for this work. If an automated research system could take over even part of this workload, the pace and cost curve of AI advancement would change dramatically.

This judgment also echoes other recent voices in the industry. For instance, Anthropic’s “When AI Builds Itself,” mentioned at the beginning of this article, does not take a light tone—it calls for industry coordination to have the option to pause or temporarily halt frontier AI development as recursive self-improvement approaches, allowing time for societal structures and alignment research to catch up. For more details, see “AI Self-Evolution Too Fast: Anthropic Calls for Global Pause on Development.”

https://www.anthropic.com/institute/recursive-self-improvement

Two things are happening simultaneously, and it’s intriguing. On one side, Anthropic is documenting and warning about the direction of this trajectory; on the other, teams like Recursive are steadily turning this trajectory into reality.

Of course, Recursive itself acknowledges that this is still just a "first step": the current system performs best in scenarios with clear metrics, rapid feedback, and detectable cheating, and it is still far from autonomously advancing open scientific problems. Preventing reward-based cheating will remain a core challenge on the path to scaling.

But a closed loop has already begun to operate. The only remaining question is how fast it will turn.

This article is from the WeChat public account "Machine Heart" (ID: almosthuman2014), authored by Machine Heart in Recursive Evolution, edited by Panda.