Tsinghua alumnus Wang Guan's HRM-Text achieves SOTA with 1/900 tokens and 1/432 compute.

Breaking the traditional large model pretraining paradigm, Tsinghua University’s 00s alumnus Wang Guan and his team unveil a new achievement:

They replaced the standard Transformer with a Hierarchical Recurrent Model (HRM) to propose HRM-Text, an efficient pretraining approach that goes beyond scaling.

Paper link: https://arxiv.org/abs/2605.20613

HRM-Text achieves performance comparable to open-source models with 2B to 7B parameters, using only about 100–900 times fewer training tokens and 96–432 times less estimated compute.

At the same time, HRM-Text achieved the following results on mainstream benchmarks using 1B parameters, 40B non-repeating tokens, and a training cost of approximately $1,500: MMLU 60.7%, ARC-C 81.9%, DROP 82.2%, GSM8K 84.5%, MATH 56.2%.

Figure | Pretraining Efficiency.

On this basis, they explicitly state that structural priors and targeted training objectives can significantly lower the barrier to pre-training. This training approach makes it feasible to train foundational models from scratch.

How is HRM-Text designed?

Pre-training of large language models (LLMs) is increasingly reliant on a small number of institutions with sufficient computational power and data resources. Training a competitive foundational model often requires trillions of tokens, thousands of GPUs, and even millions of dollars in computational investment.

However, the current training paradigm is inefficient, as substantial computational resources are wasted on irrelevant tokens such as prompts, format filling, and web noise, leaving much of the training compute power unused for actual inference.

In this work, the research team redesigned the architecture and training objectives to make the pretraining of HRM-Text relatively more efficient.

Architecture: A hierarchical recurrent model with dual time scales, splitting computation into a slow H module and a fast L module. While standard Transformers perform a single forward pass per token, HRM performs multiple recursive updates on the same token. The H and L modules each account for half of the recursive core parameters, resulting in an overall computational cost equivalent to four recursive unfoldings of the same parameter set—increasing computational depth without increasing parameter count.

Training objective: Instead of using standard full-text autoregressive pre-training, train directly on instruction-answer pairs, computing loss only on the answer portion, while using a PrefixLM mask to enable bidirectional attention on the instruction and causal masking for generating the answer.

Figure | HRM-Text Architecture.

To improve the stability of recursive training, the research team introduced MagicNorm and Warmup Deep Credit Assignment.

MagicNorm is a hybrid normalization strategy that leverages the asymmetry between forward and backward computation depths under truncated backpropagation through time (BPTT). It applies PreNorm within modules and adds additional normalization at module outputs to enhance the stability of deep recurrent training.

Warmup Deep Credit Assignment initially backpropagates gradients through only the last 2 recursive steps during early training, then linearly extends to the last 5 steps. This training mechanism enables the model to converge stably over shorter credit paths before gradually incorporating longer dependencies.

How effective is it?

Experimental results show that HRM-Text demonstrates clear advantages in architectural efficiency, training objectives, and overall performance.

1. Under fixed training compute, is a recurrent architecture more effective?

The results show that, under FLOPs-aligned conditions, HRM 1B outperforms Transformer 1B, Transformer 3B, Looped Transformer 1B, and RINS 1B on most benchmarks; comparisons with TRM also indicate that HRM exhibits more stable training.

Figure | Comparison of performance and stability with Transformer models. HRM maintains stable training dynamics across all scales, while the Transformer model exhibits severe instability at the 10-billion-parameter scale. Additionally, at the 0.6B scale, HRM achieves competitive performance on most benchmarks using only half the computational cost of the Transformer model.

2. Are task completion goals and PrefixLM helpful?

Ablation studies show that, under FLOPs alignment, the MMLU score of a 1B Transformer increases sequentially from 40.55 with standard autoregressive training, to 47.72 after introducing task completion objectives, to 53.15 after incorporating PrefixLM, and finally to 60.73 after switching to the HRM architecture.

Figure | Performance comparison between different model architectures and training objectives

How does HRM-Text compare in efficiency to contemporary open models?

HRM-Text 1B achieves scores of 60.7, 81.9, 82.2, 84.5, and 56.2 on MMLU, ARC-C, DROP, GSM8K, and MATH, respectively. Compared to open models trained with significantly larger training budgets, it enters the performance range of 2B to 7B open-source models using only 40 billion unique tokens and 1 billion parameters—requiring up to 900 times fewer training tokens and up to 432 times less computational cost.

Figure | Evaluation results of HRM-Text 1B compared with other fully open-source and open-weight models during the same period

4. Does the loop structure result in greater effective depth?

The results show that the standard Transformer and Looped Transformer stabilize at shallower layers, while HRM maintains more pronounced inter-block representation differences, lower cosine similarity, and higher logit lens KL values at deeper layers.

Figure | Effective Depth Analysis.

Figure | Layer-by-layer Logit Lens KL Analysis.

Shortcomings and Future Directions

Although HRM-Text demonstrates strong performance on reasoning-intensive tasks, this approach still has limitations and suggests directions for future research.

1. Decoupling "Knowledge" from "Reasoning"

Currently, broader factual knowledge coverage still relies more on model scale and data breadth. HRM-Text was trained on only 40 billion unique tokens, and explicitly knowledge-based sources constitute only a portion of the task-formatted mixed data. In the future, researchers need to design compact reasoning cores separately from external factual storage, delegating knowledge breadth to curated corpora, retrieval-augmented modules, or learnable memory.

2. Adaptive Computing Time

The cyclic scheduling of HRM-Text enables greater effective serial depth, but this also means the model must execute a fixed number of recursive steps during inference. In the future, a promising direction to explore is the introduction of an adaptive computation time mechanism, allowing simple samples to halt computation earlier while reserving the full recurrent budget for difficult samples, thereby reducing inference costs.

3. The current scale of verification remains limited.

The current scaling experiments only cover a 3B-parameter Transformer baseline and a 1B-parameter HRM-Text. The research team states that whether similar efficiency advantages can be maintained at larger model scales remains to be further validated in future work.

4. PrefixLM and Inference Framework

Currently, PrefixLM still faces certain engineering implementation limitations in practical deployment. Although it can run on standard text generation inference frameworks such as vLLM, this requires the framework to support custom attention masks during the prefill stage. When extending it to multi-turn dialogue scenarios, further design of the KV-cache mechanism is needed to ensure bidirectional visibility within user segments while maintaining causal constraints for the assistant’s generation process.

For more technical details, please refer to the original paper.

This article is from the WeChat public account "Academic Headline" (ID: SciTouTiao), authored by Xia Qiansi.