Anthropic trained Claude Code through the Marlin project, recruiting approximately 1,000 external software engineers via the data company Snorkel AI to perform A/B testing on code generated by the model at a rate of $280 per task.

Author and source: AI New Era

Recently, a report brought to light the "secret to progress" behind Claude Code.

Business Insider reports that Anthropic has a dedicated project to improve Claude Code, refining it through feedback from approximately 1,000 software engineers.

This project, internally codenamed "Marlin," is being developed at the data company Snorkel AI.

Back in January, Boris Cherny, head of Claude Code, revealed that he hadn’t handwritten a single line of code in over two months, with Claude submitting 22 pull requests in one day and 27 the day before—all written by the model.

There have also been reports that most of Anthropic's internal code is generated by AI.

The interesting part is right here.

On one hand, Anthropic’s own core engineers have delegated a significant amount of coding work to the model; on the other hand, it is spending money to hire approximately 1,000 external engineers to personally teach Claude Code what constitutes “good code.”

What exactly are you buying for $280 per hour?

According to Business Insider, the external engineers hired by the Marlin project all have software engineering backgrounds. Their work sounds very much like a real code review.

The process is roughly as follows. First, select a GitHub code repository from a list containing thousands of repositories. Then, create a pull request—the step where developers submit code changes. Finally, write a prompt to clearly outline the task.

The model will generate two sets of code, and the external engineers will then conduct an A/B test: comparing the two outputs to select the better one.

Each task pays $280 and takes about an hour. Some require multiple rounds of review with Snorkel.

The evaluation criteria assess the correctness, security, reliability, and maintainability of production-grade code.

Provide two real examples.

In a task, an external engineer asked the model to refactor how the system handles execution metadata, aiming to make the code clearer and more maintainable without changing functionality.

In another task, an external engineer performed a security fix for MLflow, an open-source machine learning platform, addressing a command injection vulnerability that could occur when it downloads Python packages during model loading. The requirements were clear: block command injection without interfering with legitimate pip (Python package manager) options.

The requirements for these tasks go beyond data annotation; they are more like asking a seasoned engineer to directly transfer their internal sense of “this is better” to the model.

Clearly, Anthropic did not purchase code, but rather the judgment of experienced programmers on how to write safer, cleaner code.

Why does it have to be an engineer?

Why is Anthropic going to such lengths? Because Claude Code is no longer just a chatbox for writing code.

Anthropic officially defines it as a project-level AI agent. It can read an entire codebase, plan across files, directly execute modifications, run tests, and iterate on its own based on failed results.

Anthropic's official definition of Claude Code: A set of agents capable of reading codebases, making cross-file changes, running tests, and delivering committed code.

This means it will actively modify files, execute tasks, and interact with the entire codebase.

Anthropic is well aware of the significance of this issue, so it repeatedly addresses Claude Code’s permissions, sandboxing, and approval fatigue in its engineering blog.

By default, high-risk file modifications or command executions require user approval; to reduce approval fatigue from repeated authorizations, Anthropic has also introduced sandboxing, allowing Claude Code to run more securely within predefined file system and network boundaries.

When an AI can execute commands and modify live code, the cost of mistakes becomes entirely different. The training objective also shifts: from "writing correctly" to "writing securely, reliably, and maintainably."

These things cannot be learned from ordinary code corpora. They were once hidden in the code reviews of senior engineers, passed down as human-to-human experience. Now, Anthropic aims to turn them into purchasable data by recruiting human programming experts.

Snorkel: The Underappreciated 'Data Arms Dealer'

The real star of the whole story is Snorkel.

This company emerged from Stanford’s AI Lab in 2019, betting exclusively on the idea that data—not models or computing power—is what truly determines the success of machine learning.

The two key founders of Snorkel, Alex Ratner and his Stanford advisor Chris Ré, cite Snorkel’s core academic origins.

Alex Ratner, Co-founder and CEO of Snorkel AI

In 2015, Snorkel was merely a "afternoon project" during Ratner's PhD: instead of spending large sums hiring people to manually label data, why not use programs and rules for "weak supervision," enabling models to learn without requiring manual annotation of each example?

With this approach, Snorkel accumulated over 60 papers, and its open-source tools were adopted by Google and Intel, until it was officially spun out as a company in 2019.

Chris Ré, co-founder of Snorkel AI and professor at Stanford

Ratner’s mentor, Chris Ré, is also a formidable figure.

He is a Stanford professor, a MacArthur Fellow, a serial entrepreneur whose projects have been acquired by Apple, and the founder of SambaNova, which once reached a valuation of $5 billion.

The most interesting thing is the company's transformation.

Back then, Snorkel aimed to solve the long-standing problem that manual labeling was slow, expensive, and inconsistent—when AI development consumed about 80% of its time on hand-labeling data. Thus, Snorkel’s original vision was to free humans from the burden of labeling as much as possible.

But in the era of frontier models, the most scarce and valuable resource has returned to people—specifically, the taste and judgment of experts such as PhDs, doctors, lawyers, and senior engineers. This company, which began by “using fewer people,” now finds its most profitable business to be assembling an expensive army of experts to train frontier AI—Marlin is just one such project.

Its workflow aligns perfectly with the needs of the Marlin project.

The Snorkel website describes this workflow as follows: First, define the task, scoring criteria, and validators to establish "what constitutes good," then run the expert review pipeline, with the author, multiple reviewers, and a final decision-maker each reviewing the work, with a complete audit trail maintained throughout.

The Snorkel website states: After disagreements in review scores are resolved through arbitration, the changes are documented in the scoring criteria revision log, with every modification traceable to who made it, when, and based on what rationale.

It also sets up the evaluation environment and data so that the same tasks can be repeatedly run across different model versions, yielding reproducible and comparable scores. To ensure the scores are clean and comparable, evaluators must not be influenced by the version being tested. This is why these external engineers do not know which version they are evaluating.

The quote also speaks volumes.

Snorkel offers publicly available legal contract roles, paying $10 to $100 per high-quality task; in contrast, Marlin’s software engineering tasks pay $280 per task, taking about an hour—equating to roughly two and a half times the industry rate (Scale AI and Mercor pay engineers up to $110 per hour). Top experts can earn over $3,000 per week.

The feedback from these external engineers recruited by Snorkel is truly expensive.

The client list includes Google, Mistral, and Anthropic. In May 2025, Snorkel completed its Series D funding round at a $1.3 billion valuation.

Kate Jensen, Anthropic’s head of revenue, said that fully unlocking Claude’s potential requires new evaluation methods incorporating domain experts and human feedback, and Anthropic will continue collaborating with companies like Snorkel.

Companies like Snorkel, Scale, and Mercor were once regarded as "labeling platforms." Today, they have become the invisible supply chains behind cutting-edge model companies.

It is this invisible, global army of experts that feeds the smartest AI.

Several major players

They are competing for the same data.

It's not just Anthropic buying real engineering talent. Several major players are participating in this competition, each with different strategies.

The cursor follows the path of product data.

It clearly states: After users enable privacy mode, the code will never be used by it or any third party for training; only when privacy mode is disabled may it use codebase data, prompts, editing behaviors, and code snippets to improve AI features and train models.

Cursor's Tab model generates over ten billion edited characters per day, with request volumes approximately 100 times higher than the initial version. The further enhanced Composer, trained using reinforcement learning (RL), enables the model to learn how to invoke tools such as editing and searching within extensive code task environments, thereby handling longer-term engineering tasks.

The latest Composer 2.5 is now primarily focused on long-running tasks requiring hundreds of steps.

Musk uses a capital commitment/acquisition option approach.

In February this year, xAI was merged into SpaceX. At the end of April, SpaceX secured the right to acquire Anysphere, the parent company of Cursor, for $60 billion this year, or alternatively, to pay $10 billion upfront for deep collaboration. What Musk values most is the world’s most active real-time developer behavior data held by Cursor.

On May 25, Musk announced on X that the training of the next-generation foundational model, Grok V9-Medium, has been completed, with 1.5 trillion parameters—three times that of the current production model. He specifically noted that this performance was achieved before additional fine-tuning with Cursor data, and that programming capabilities will improve significantly after this step. The model is expected to be released in mid-June.

As a result, V9 will be the first Grok to systematically "consume" real developer behavior data.

OpenAI’s subsequent Codex also followed this path. The Codex released in 2025, powered by codex-1, is trained via reinforcement learning on real coding tasks, aiming to generate code that closely resembles human style, adheres to PR conventions, and repeatedly runs tests until they pass; each task runs in an isolated sandbox preloaded with your codebase.

Codex has now been upgraded to OpenAI’s agentic coding platform, powered by its state-of-the-art coding models, with over five million users weekly.

What they are competing for is actually the same thing: process data, just through different paths.

Anthropic first had models but lacked real-world development feedback, so they paid approximately 1,000 engineers to break down the software engineering process into learnable data;

Cursor already has products and real user behavior, as well as its own proprietary programming models such as Tab and Composer. However, compared to OpenAI and Anthropic, it lacks a general-purpose foundational model infrastructure and large-scale training compute power;

What Musk also lacks is data, so he might as well try to acquire a product entry point that continuously generates developer behavior data for tens of billions of dollars;

With ample access to OpenAI models and products, they built their own sandbox, allowing the model to repeatedly trial, test, correct, and iterate through real coding tasks using reinforcement learning.

Several companies employ different approaches but ultimately converge on using data that increasingly resembles real-world engineering environments to train their AI programming models.

The real moat

It's about human taste and judgment.

A paper titled SWE-chat presents the first large-scale collection of real agent coding conversations: 6,000 sessions, over 63,000 user prompts, and 355,000 tool calls.

It yields a sobering statistic: only 44% of the code generated by agents ultimately made it into user submissions. More than half were deleted, modified, or overturned.

SWE-chat real-world test: Vibe coding accounts for 41% of conversations, but only 44% of the code generated by the agent is ultimately submitted; users counteract the model’s output in 44% of interaction turns through corrections, error reports, or interruptions.

This indicates that older benchmarks like HumanEval have been saturated, and relying solely on scores is no longer meaningful. The real battleground lies in the data from real-world development processes—iterative, trial-and-error, and rebuild cycles.

The stronger the model, the more you need to spend to acquire the parts humans still haven't been replaced in: engineering intuition.

Anthropic pays $280 per task, hiring around 1,000 engineers to conduct A/B voting: this seemingly cumbersome effort is precisely what they’re buying.

Whoever can turn engineering site data into models that can process it holds the ticket to the next stage of AI programming.

Anthropic Hires 1,000 Engineers at $280 per Task to Improve Claude’s Code

What exactly are you buying for $280 per hour?

Why does it have to be an engineer?

Snorkel: The Underappreciated 'Data Arms Dealer'