Sapient Intelligence has released the HRM-Text model with approximately 1 billion parameters, achieving scores of 56.2, 84.5, and 81.9 on MATH, GSM8K, and ARC-Challenge, respectively, at a training cost of just $1,500.

Article author and source: MachineHeart

A model with approximately 1 billion parameters achieved 56.2 on MATH, 84.5 on GSM8K, and 81.9 on ARC-Challenge. The training cost was around $1,500, using 16 H100 GPUs for less than two days.

This is the HRM-Text released by Sapient Intelligence on May 18, 2026; the team has also publicly released the paper, model weights, and pretraining code.

If you only look at these numbers, the most intuitive reaction might be: Could this be the result of some fine-tuning? Standing on the shoulders of giants, of course, makes things easier.

But HRM-Text is not. It was pre-trained from scratch using only about 40B unique tokens (with total training volume in the experimental table accounting for repeated sampling at approximately 60B tokens), roughly 1/225 of the training volume of Llama 3.2 3B (9T tokens) and 1/900 of Qwen3.5 2B (36T tokens).

Comparison of HRM-Text with other models in terms of training FLOPs, training tokens, and benchmarks.

The natural question arises: How is this achieved?

Over the past few years, the large model industry has developed a nearly default growth logic: larger models, more data, and stronger computing power will continue to enhance intelligent capabilities.

This approach has been thoroughly proven effective. The ongoing evolution of models such as GPT, Claude, DeepSeek, and Qwen has relied on increases in parameter scale, data scale, and training compute power. At the same time, foundational model training is becoming increasingly like heavy industry: longer training cycles, more expensive GPU clusters, more complex data engineering, and rising barriers to entry.

But HRM-Text wants to explore another approach: under limited data and limited computational resources, can the output of each computation be improved through the joint design of architecture and training objectives?

The paper title directly indicates the direction it aims to challenge: Efficient Pretraining Beyond Scaling.

Paper Title: HRM-Text: Efficient Pretraining Beyond Scaling
Paper URL: https://arxiv.org/abs/2605.20613
GitHub: https://github.com/sapientinc/HRM-Text
Hugging Face: https://huggingface.co/sapientinc/HRM-Text-1B
X Launch post: https://x.com/Sapient_Int/status/2056510383935172798

In simple terms, HRM-Text simultaneously adjusts both how the model computes and what it learns: on one hand, it enables limited parameters to perform multiple rounds of internal computation before output, increasing effective computational depth; on the other hand, it computes loss only on the answer portion, concentrating the training signal more effectively on task understanding and answer generation.

Note that HRM-Text is not a mature chat model that has undergone post-training or reinforcement learning optimization. The team defines the current version as a Proof of Concept: its value lies not in discovering the final form of a language model, but in providing a testable example that demonstrates significant potential for architectural innovation in the efficiency of foundational model pre-training.

Before outputting once, complete multiple rounds of internal calculations.

The first change to HRM-Text is the reorganization of the internal computation process of the model.

Standard Transformers typically consist of a series of network layers whose parameters are independent of each other. The input propagates forward through the model's depth: passing through the first layer, then the second, and so on, until the output is produced. A straightforward way to enhance model capacity is to stack more layers, increase the hidden dimension, or train more parameters.

HRM-Text did not simply follow this approach. It introduced two modules operating at different time scales: a high-level module H and a low-level module L.

For a more intuitive analogy, a standard Transformer is like passing a document sequentially to several different editors, each making one round of revisions before passing it on; HRM-Text, by contrast, is like having two teams of editors repeatedly revise the same internal draft. The model doesn’t simply add more parameters—it enables limited parameters to engage in deeper, more effective computations.

According to team interviews, this design also differs from the common "big brain-small brain" collaboration approach in the industry, which typically involves training two models of different scales, with the larger model handling complex planning and the smaller model responsible for rapid execution, while communication between models primarily relies on text-based interfaces.

The H and L in HRM belong to the same network. They are not two separate models, nor do they divide tasks through textual space; instead, they iteratively refine the same internal state within a shared latent space. What information is passed between modules and how they divide responsibilities are jointly determined by a unified optimization process.

More precisely, HRM does not append a planner and an executor externally to the model, but rather integrates hierarchical computation directly into a single model.

Lower-level modules update more frequently, handling local computations and iterative corrections; higher-level modules update more slowly, maintaining a more stable semantic context and providing longer-term constraints for lower-level computations. According to the paper’s setup, each forward pass executes two higher-level cycles. Each cycle first completes three L-module updates, followed by one H-module update.

In other words, before predicting a token, the model performs eight recursive updates: six low-level updates and two high-level updates.

H/L dual-time-scale recursive structure, internal module structure, and PrefixLM attention mask.

It should be emphasized that "multi-round internal computation" does not mean the model can dynamically adjust its thinking time based on problem difficulty. The current version uses a fixed recursive schedule: regardless of whether a task is simple or complex, the model performs internal updates a predetermined number of times. Adaptive computation time will be an area for future exploration.

This also means that a 1B parameter model does not have the same inference cost as a standard 1B dense Transformer. Recursive calls improve parameter utilization but increase the sequential computation required before each token output. Therefore, parameter scale, training cost, and actual inference efficiency must still be discussed separately.

This path is not without cost.

The deeper the internal loop, the more opportunities the model has to continuously refine its representations; however, after the same set of modules is repeatedly invoked, activation variance may accumulate, and gradients are more prone to vanishing or exploding. Recurrent architectures are not a new concept—the real challenge lies in achieving stable training of deep recurrent networks on open-domain language tasks.

HRM-Text introduces two designs for this: MagicNorm and warmup deep credit assignment.

MagicNorm aims to ensure stability in both forward and backward propagation. The module retains the PreNorm structure, which facilitates gradient flow, but adds an additional normalization step each time the recursive module exits. This approach limits the growth of activation variance during repeated cycles while preserving a smooth gradient path.

Warmup deep credit assignment controls how far back the gradients need to be traced. At the beginning of training, the model backpropagates gradients only through the last two recurrent steps; as training gradually stabilizes, the backpropagation range increases linearly to encompass the final five steps.

It can be understood as a gradual accountability mechanism: in the early stages of training, the model is first held responsible for the internal computations closest to the output; once stable, responsibility is progressively extended to earlier computations. This approach enables the use of deeper recursive calculations while preventing the model from being exposed to excessively long gradient paths from the outset.

The paper also analyzes this structure from the perspective of effective depth.

In standard Transformers or partially looped Transformers, as the number of layers increases, the changes to hidden states in subsequent layers may gradually diminish, causing the model to converge toward a relatively stable output distribution early on. However, HRM-Text’s analysis reveals that its deeper computations still maintain significant representation changes. This indicates that the recursive steps are not merely repeating operations but continuously modify internal states, allowing deeper computational steps to still contribute incremental information.

Comparison of Effective Depth across different architectures.

Predict less and focus the training signals on the response.

In addition to architectural changes, the second modification to HRM-Text occurs in the pretraining objective.

Most language models use an autoregressive "next token prediction" approach: given a piece of text, they predict the next token. Regardless of whether the input is a webpage, book, forum reply, or code, the model must learn to predict every position in the sequence. While this objective is sufficiently general, it also means that a large amount of training signal is used to predict text that has little relevance to task completion.

HRM-Text chose a more targeted approach: it skips the large-scale raw text pretraining stage and instead trains from scratch using instruction-answer data pairs. Given an instruction and its corresponding answer, the model computes token-level loss only on the answer portion.

This does not mean the instruction component plays no role in learning. The answer loss still influences how the model understands and uses instructions along the attention path. However, the model no longer needs to predict the question itself; instead, update signals are more focused on generating appropriate answers.

For a more intuitive analogy: when grading exams, the teacher no longer scores students for copying the questions—only their answers are evaluated.

The PrefixLM mask is used in conjunction with "answer-only" targeting. In a standard causal mask, each token can only attend to preceding tokens. This design is suitable for left-to-right generation, but such a restriction is unnecessary when the full instruction is already provided.

HRM-Text allows tokens in the instruction portion to be mutually visible; after entering the response portion, it reverts to the standard causal generation method.

Thus, the model can first integrate the entire instruction as a complete context and then generate the answer step by step. In a decoder-only implementation, it achieves an approximate division of labor between encoder and decoder: the instruction side behaves more like encoding, while the response side behaves more like decoding.

The attention analysis of the paper shows that, compared to a pure causal mask, PrefixLM results in higher attention entropy and more global and diverse attention patterns. It does not merely alter a single mask, but enhances how the model utilizes instruction information.

Only calculate the differences in loss, PrefixLM attention mask, and attention distribution for the response.

The effects of these design choices can be clearly observed in the ablation studies.

Under the same training FLOPs condition, the research team sequentially added "answer-only prediction," PrefixLM, and HRM architectures, observing how model performance changed.

Using ARC-Challenge as an example, a 1B Transformer achieves a score of 51.91 with full-sequence prediction and causal masking; improving to 62.88 when predicting only the answer; further increasing to 74.32 with PrefixLM; and finally reaching 81.91 after switching to the HRM architecture.

On MATH, scores improved sequentially from 35.44 to 47.04, 48.36, and 56.16. On GSM8K, scores also increased sequentially from 48.37 to 69.75, 75.06, and 84.53.

These results show that HRM-Text’s efficiency does not stem from a single modification, but rather from the combined effect of three directions: a hierarchical recursive architecture that increases effective computational depth; a task completion objective that focuses training signals on task completion; and PrefixLM, which improves how the model integrates instruction context.

To ensure result reliability, Sapient Intelligence conducted a systematic validation for data contamination issues. HRM-Text was trained exclusively on publicly available and traceable data, and underwent rigorous data contamination analysis on the evaluation set. Under the strictest Clean Split conditions, the model still achieved performance gains consistent with the main experiment, demonstrating that the improvements stem from inherent architectural advantages rather than test set leakage. See the paper for detailed analysis.

When placing HRM-Text within a broader comparison of smaller models, its characteristics also become evident.

It excels on benchmarks focused on task execution and reasoning, such as MATH, GSM8K, DROP, and ARC-Challenge; on benchmarks like MMLU that rely more heavily on broad knowledge coverage, it is competitive but not leading.

For example, the paper lists Qwen3.5 2B achieving 64.5 on MMLU, higher than HRM-Text’s 60.7; OLMo3 7B reaches 65.8. However, on MATH, HRM-Text’s 56.2 is higher than Qwen3.5 2B, Llama 3.2 3B, Gemma3 4B, and OLMo3 7B listed in the table.

This difference is not hard to understand.

With limited training data and parameter scale, the model struggles to cover a sufficiently broad range of factual knowledge. HRM-Text is better understood as a compact model optimized for task execution and reasoning, rather than a general-purpose product model that has achieved comprehensive knowledge coverage, dialogue alignment, and engineering refinement.

The team also provided a more specific explanation in the interview: limited training data means the model has not adequately covered the long tail of the data; a smaller parameter scale means that even if the model has encountered some low-frequency information, it is harder to stably retain it in the parameters.

The paper proposes a subsequent direction: decoupling the reasoning core from the knowledge storage component. In the future, compact recursive models similar to HRM-Text can focus on computation, planning, and task execution, while factual coverage can be handled by retrieval systems, external knowledge bases, or learnable memory modules.

The team stated in the interview that they have achieved some early results in the direction of "reasoning-knowledge decoupling," but have not yet disclosed specific experiments.

This does not mean that knowledge can be easily stripped away from the model. How external knowledge enters multi-round internal computations, how retrieval results interact with latent space states, and how memory modules are trained still require systematic experimentation.

On the other hand, it is not the first model to explore recursive computation, latent space reasoning, or PrefixLM. Works such as Looped Transformer, RINS, Huginn, and Ouro have all investigated parameter reuse, internal loops, or latent space computation to varying degrees. Conditional generation and PrefixLM also have well-established research histories.

HRM-Text is better positioned as a low-budget, from-scratch pretraining framework that integrates hierarchical dual-time-scale recurrence, recurrent stable training methods, "answer-only" targeting, and PrefixLM, delivering reproducible results at the 1B scale.

Enable HRM to enter an open language environment

HRM-Text is not the first time Sapient has explored hierarchical recursive computing.

In June 2025, the team proposed the HRM (Hierarchical Reasoning Model) architecture, which consists of the high-level and low-level modules, dual-time-scale computation, and latent space iteration mentioned earlier.

Paper Title: Hierarchical Reasoning Model

Paper URL: https://arxiv.org/pdf/2506.21734

The team subsequently open-sourced the first-generation model, HRM-Symbolic, in July 2025, primarily targeting symbolic reasoning tasks with well-defined boundaries. Through hierarchical modules, dual-time-scale computation, and latent space reasoning, it demonstrated the HRM architecture’s potential for handling combinatorial search problems in tasks such as complex Sudoku, maze navigation, and ARC-AGI.

But this is only the first step.

Whether it’s Sudoku or maze navigation, these tasks have relatively clear rules, state spaces, and verifiable answers. In contrast, the environment faced by language models is far more open: natural language is ambiguous, knowledge domains are broader, and output formats are more diverse. Models must not only perform reasoning but also understand context, organize language, and generate appropriate responses in open-ended scenarios.

More importantly, recursive architectures that work well in symbolic tasks cannot necessarily be directly transferred to language modeling. As recursive depth increases, activation values and gradients are more prone to instability. HRM-Text introduces MagicNorm and progressive deep credit assignment specifically to enable stable scaling of deep recursion to language models.

If HRM-Symbolic addresses the question, "Is this architectural approach feasible?" then HRM-Text begins to answer another, more critical question: Does this architecture remain effective when tasks enter an open-domain language environment?

Based on the current results, the answer is at least worth further exploration.

Notably, recursive latent space reasoning is also attracting attention from other research teams.

On May 19, 2026, Turing Award laureate Yoshua Bengio co-authored the publication of "Generative Recursive Reasoning." The paper introduces GRAM (Generative Recursive Reasoning Models), which builds directly upon the hierarchical recursive reasoning framework established by HRM, further incorporating a probabilistic multi-trajectory reasoning mechanism.

This work demonstrates that HRM has evolved beyond a standalone model innovation and is now becoming a foundational research pillar for the next generation of reasoning AI, continuing to attract leading scholars worldwide to explore this direction further.

Why did Sapient rebuild its architecture?

Sapient Intelligence's exploration of HRM is related to the previous technical paths of the two founders.

Wang Guan, founder of Sapient, has long been focused on reinforcement learning and has conducted related research and engineering work at Tsinghua University’s Brain and Intelligence Laboratory, the Shanghai Artificial Intelligence Laboratory, and Pony.ai. He is also a core developer of OpenOrca and the author of OpenChat. Co-founder Chen William has R&D experience at companies such as DJI and Hesai Technology, and previously led technology commercialization efforts at Tsinghua University’s Innovation and Entrepreneurship Center.

The two began their AGI exploration in 2020, when large language models had not yet demonstrated the influence they have today. Rather than focusing solely on scaling up, they were more interested in another question: Could intelligent systems learn continuously, like humans, by interacting with their environment and making progress with limited resources?

Therefore, the team initially focused on reinforcement learning, dedicating their efforts to scenarios such as autonomous driving and robotics. As GPT-3 and ChatGPT emerged, they shifted their direction to explore the potential of combining reinforcement learning with large language models—a pursuit that eventually led to the creation of OpenChat.

The success of OpenChat validated the value of optimizing post-training data quality and training objectives, but it also prompted the team to consider a more fundamental question: If the model’s underlying architecture remains Transformer-based, will performance gains continue to rely increasingly on more parameters, more data, and larger computational clusters, regardless of improvements in post-training methods?

For a startup, this is not just a theoretical issue. Continuing along the mainstream path means entering a race dominated by capital and computational power. Sapient ultimately chose to shift its focus to the underlying architecture: rather than merely optimizing how existing models are trained, it rethought how intelligent systems should organize computation.

HRM has thus become the team's core technology roadmap.

Sapient summarizes its long-term direction as Lean General Intelligence: not merely pursuing larger models, but seeking more efficient, accessible, and generalizable intelligent systems. HRM-Symbolic and HRM-Text are two milestone outcomes along this path.

HRM-Text provides a data-backed, reproducible, and further testable case: in a domain typically requiring massive tokens and large clusters, modifying the computational architecture and training objective enables a 1B parameter model to achieve performance within the range of certain 2B to 7B open-source models at a significantly lower budget.

The real challenges may still lie ahead. The team noted in interviews that if HRM is expanded to larger scales in the future—or integrated with MoE, retrieval systems, and learnable memory—the stability issues of the recursive architecture could compound with the training difficulties of new modules. Questions such as where to place expert modules within the network, how to optimize them, and how external knowledge should enter multi-round internal computations still require systematic experimentation.

Beyond scaling, another path has just begun

Undeniably, HRM-Text has not yet emerged as a mature alternative capable of fully replacing Scaling Laws. Its underlying data ratios, actual inference costs, potential for scaling to larger parameter sizes, and performance on extremely complex open-ended tasks all require further time and independent replication by the open-source community to be validated.

It is also not a rejection of scaling. Over the past few years, increasing the scale of parameters, data, and computational power has repeatedly proven its effectiveness. Future model advancements will likely still require higher-quality data, more sufficient computational resources, and more systematic engineering efforts.

However, what HRM-Text demonstrates may be more than just a new model architecture.

If the primary growth axis of AI over the past decade has been the continuous expansion of parameter scale, data scale, and training compute power, then what HRM explores is another, more fundamental question: Can the computation process itself become a new growth axis?

The fundamental idea of the standard Transformer is to enhance the model's representational capacity by stacking more parameters. HRM, however, seeks to enable a limited number of parameters to engage in multiple rounds of hierarchical recursive computations in the latent space, allowing the model to perform deeper internal state updates before producing an output. Subsequent research, such as GRAM, further demonstrates that this approach can be extended toward probabilistic modeling, multi-trajectory computation, and increased width during inference.

From this perspective, the value of HRM-Text lies not only in the benchmark performance achieved by a model with approximately 1 billion parameters or in the GPU time saved through a low-cost pretraining experiment.

More importantly, it provides a reproducible, comparable, and potentially falsifiable or improvable case: redesigning the computational architecture, beyond simply scaling up model size, may also alter the relationship between performance, cost, and capability.

In an industry profoundly shaped by scaling, this possibility alone is significant. As next-generation intelligent systems grow, their advancement may not only come from more parameters, more data, and more compute—but also from a more fundamental question: how should models actually think?