DeNovoSWE Dataset Released for Long-Horizon Code Generation

As the capabilities of LLM Code Agents continue to improve, more researchers recognize that it is time to move toward the next stage: long-term tasks that better reflect real-world requirements. As a result, benchmarks for evaluating long-term tasks have emerged, such as NL2RepoBench and BeyondSWE. The expected role of Code Agents is gradually shifting from repository maintainers to architects—capable of planning and completing long-term code tasks across entire repositories.

Recently, the Gaoling School of Artificial Intelligence at Renmin University of China completed related research and prominently released the DeNovoSWE dataset, focused on long-range software engineering tasks, particularly zero-to-one code generation at the repository level.

Paper link: https://arxiv.org/pdf/2606.10728

Repository link: https://github.com/AweAI-Team/DeNovoSWE

Data link: https://huggingface.co/collections/AweAI-Team/denovoswe

We constructed a high-quality dataset using the Divide & Conquer and Critic & Repair mechanisms, successfully achieving scaling for long-range SWE tasks, and created an open-source, high-quality long-range SWE task dataset containing 4,818 real-world examples. This achievement provides large-scale data for training Code Agent’s long-range capabilities, significantly enhancing its performance on long-range tasks.

The paper also provides a scoring-based filtering method according to question difficulty, effectively alleviating the trade-off between the proportion of difficult questions and trajectory quality.

Experiments show that Qwen3-30B-A3B-Instruct trained on DeNovoSWE improved from 5.8% to 47.2% on BeyondSWE-Doc2Repo and from 4.3% to 23.0% on NL2RepoBench, demonstrating a significant enhancement in repository-level code generation capability through long-range data.

Rebuild the entire repository starting from a document

Over the past year, as large-scale SWE datasets like Scale-SWE have been scaled, code agents have made rapid progress on real-world software engineering tasks such as SWE-bench. However, as models become increasingly adept at “fixing an issue” or “changing a few lines of bug code,” a more critical question emerges: Do agents truly possess long-range software engineering capabilities? Results from cutting-edge models like BeyondSWE-Doc2Repo and NL2RepoBench suggest that performance remains inadequate.

Real-world software development often involves understanding requirements, planning architecture, creating files, designing APIs, managing dependencies, integrating modules, and ultimately ensuring the entire codebase passes testing—not just modifying a function or adding a conditional check.

In other words, the challenge lies in long-horizon, repository-level generation: producing a complete, executable, and verifiable software repository starting from a task document. This is precisely the problem DeNovoSWE aims to solve.

High-quality documentation for the "generate repository from scratch" task

In document-to-repository generation, the document is not just a README or a simple API list—it is the sole entry point for the intelligent agent to reconstruct the entire repository.

A high-quality task document must meet at least two core criteria.

First, it must be well-organized.

Repository-level tasks are inherently complex, involving multiple modules, interfaces, configurations, data structures, and interaction flows. If documentation merely stacks function descriptions together, agents can easily become lost in fragmented information. Therefore, documentation should first provide a clear overall overview of the repository, then divide sections by capability or workflow, ensuring each part corresponds to a well-defined functional boundary.

Second, it must be based on a reliable evaluation.

The document cannot be too short, or the task becomes underdefined, forcing the model to rely on unfounded guesses to pass the evaluation; nor can it be too long, or it would directly reveal implementation details, removing the challenge from the task.

High-quality documentation should describe the key behaviors upon which evaluation depends: including import paths, public APIs, inputs and outputs, default parameters, error behaviors, configuration options, pattern strings, return fields, and so on. It should also outline the general functionality to be implemented. In other words, the documentation must be sufficient for an agent to reproduce testable behavior, without becoming a copy of the implementation code.

This is also the core idea of DeNovoSWE: making documentation readable, implementable, and verifiable.

DeNovoSWE method

DeNovoSWE frames "generating a complete repository from documentation" as a large-scale, verifiable long-range software engineering task. Instead of manually written documentation, it automatically constructs high-quality examples using a sandboxed multi-agent workflow. The entire approach can be summarized in two steps: Divide and Conquer.

In the Divide stage, the system first analyzes the target repository and breaks it down into multiple repository capabilities.

Each capability corresponds to a core function or workflow in the repository, such as authentication and connection, data reading and writing, batch processing, export workflows, and more. This breaks down the large repository generation problem into several well-structured document sections.

Meanwhile, DeNovoSWE runs the original unit tests and collects execution traces to identify which functions, classes, and interfaces genuinely impact evaluation, further distinguishing between direct components, core indirect components, and non-core indirect components: interfaces directly invoked by the tests must be thoroughly documented; core indirect components that affect observable behavior must also be covered; while non-core internal implementations can be left to the agent’s discretion.

In the Conquer phase, DeNovoSWE generates documentation capability by capability using the Draft-Critic-Repair mechanism. The Draft agent first writes an initial draft; the Critic agent reviews the document for missing key APIs, behavioral contracts, or structural information; and the Repair agent then revises the document based on the feedback. This cycle iterates until each capability section is clear, complete, and aligned with the evaluation.

Ultimately, various capability documents will be merged into a single comprehensive task document, serving as the sole basis for the agent to generate a repository from scratch.

Difficulty: Why is this a long-term task?

The difficulty of DeNovoSWE stems from a fundamental shift: it is no longer about fixing at the issue level, but about generating entire repositories.

In traditional SWE tasks, agents typically work with an existing repository, needing only to locate bugs, modify localized code, and pass tests.

In DeNovoSWE, the agent operates in a cleaned environment: the original source code and tests have been removed, the Git history has been reset, and potential leakage channels—such as caches, site-packages leftovers, pip wheels, and temporary compilation artifacts—have been purged. This means the agent must rely entirely on documentation to reconstruct the entire repository. It must plan the project structure, create module files, define public interfaces, implement cross-file interactions, manage dependencies and configurations, and iteratively fix errors through multiple rounds of editing and test feedback.

Any deviation in API signatures, response fields, exception types, or default behaviors may cause test failures. Errors can also accumulate over time: a poorly designed module early on may affect multiple subsequent files and call chains.

To account for variations in repository difficulty, DeNovoSWE also introduces difficulty-aware trajectory filtering. In simple terms, easier tasks should require a higher pass rate, while difficult tasks should not be entirely discarded simply because they fail to achieve a perfect score. DeNovoSWE sets different filtering thresholds for various difficulty levels based on structural complexity and LLM-assessed difficulty, thereby achieving a balance between quality and diversity.

This is especially important for long-horizon tasks: the more complex the warehouse, the harder it is to pass all tests in a single attempt; however, the difficult warehouses, low scores, and partially successful trajectories still contain valuable insights into long-term planning and execution capabilities.

Experimental results

DeNovoSWE ultimately constructed 4,818 high-quality document-to-repository task instances, creating an executable, evaluable, and trainable long-range software engineering environment.

The experimental results show that DeNovoSWE significantly enhances the model's long-range repository generation capability. On Qwen3-30B-A3B-Instruct, the original model achieved only 5.8% on BeyondSWE-Doc2Repo and 4.3% on NL2RepoBench. Training with conventional issue-level SWE data improved performance to 29.2% and 18.3%, demonstrating that standard SWE data does have transfer effects. However, when the model was trained with DeNovoSWE, performance further increased to 47.2% and 23.0%.

This indicates that data oriented toward "fixing bugs" cannot fully replace long-range data oriented toward "generating complete repositories." To enable agents to truly master repository-level engineering, training environments must be specifically designed for long-range tasks.

On the stronger Qwen3.5-35B-A3B backbone, DeNovoSWE also delivers consistent gains: BeyondSWE-Doc2Repo increases from 43.8% to 50.0%, and NL2RepoBench rises from 23.5% to 27.1%. This further demonstrates that DeNovoSWE’s benefits are not due to accidental adaptation to a specific model, but stem from the high-quality, long-range data itself.

Conclusion

The next stage of code agents is not just to fix individual issues faster, but to understand documentation, plan architecture, organize modules, implement interfaces, and ultimately generate a complete, runnable software repository.

DeNovoSWE systematically structured this target into a trainable, verifiable, and scalable dataset. It addresses a key question: What kind of data is truly needed to train agents with long-horizon software engineering capabilities?

The solution is not more fragmented code or simpler problems, but high-quality, structured, evaluation-aligned, anti-leakage full-repository generation tasks.

Start from a document and rebuild the entire repository. This is the hurdle that long-range code agents must overcome.

Reference: https://arxiv.org/pdf/2606.10728

This article is from the WeChat public account "New Intelligence Yuan," edited by LRST.