Editor’s Note: This report, based on approximately 400,000 Claude Code sessions, explores how AI programming tools are transforming the relationship between humans and code.

The core finding of the article is that in agent programming, humans primarily determine “what” to do, while Claude primarily handles “how” to do it. Users take responsibility for most planning and decision-making, while Claude handles the majority of execution tasks. In other words, AI is taking over implementation tasks such as writing code, modifying files, running commands, and debugging, but goal setting and outcome evaluation still rely on humans.

More importantly, the effectiveness of Claude Code does not depend solely on whether the user is a programmer. The report shows that non-technical users in fields such as law, finance, management, and research have achieved success rates in code-generating tasks that are nearly comparable to those of software engineers. What truly impacts the outcome is whether the user understands the problem they are trying to solve.

This means that AI programming lowers the barrier to implementation, not the barrier to judgment. In the future, those who understand business, understand context, and can clearly articulate requirements and evaluate outcomes may be better equipped to leverage AI than those who simply know how to write code. AI will not automatically replace domain expertise; rather, it will amplify the value of domain expertise.

The following is the original text:

Key Findings

Building on existing research, we propose a framework for studying interactive agent programming. This framework is based on a privacy-preserving analysis of approximately 400,000 Claude Code sessions between October 2025 and April 2026, evaluating task composition, human-AI collaboration patterns, and task success rates.

In a typical conversation, humans handle most planning decisions—determining “what” to do—while Claude handles most execution decisions—determining “how” to accomplish it. The more expertise a user has in a given domain, the greater the amount of work triggered by each instruction that Claude performs. In coding tasks, the average success rate across major professional groups—measured by whether the user’s intended outcome was achieved and supported by verifiable evidence such as passing tests or submitted code—is nearly equivalent to that of software engineers.

The stronger a user’s domain expertise, the more likely a conversation will succeed. However, the gap between intermediate and expert users is relatively small. Over the seven months we observed, the proportion of sessions used for debugging nearly halved, with usage shifting toward more end-to-end agent workflows: deploying and running code, analyzing data, and writing non-code documentation.

Over these seven months, the value of typical tasks has increased across nearly all types of work. By comparing with freelance job postings, we estimate an average increase of approximately 25%.

Introduction

Agent-based programming is rapidly gaining momentum. Since late 2025, the proportion of GitHub projects showing agent coding activity has more than doubled, and Claude Code users now average 20 hours of tool usage per week. Can individuals without formal programming experience successfully direct an agent to perform complex technical tasks? How will the rapid adoption and enhanced capabilities of these tools impact knowledge work more broadly? We don’t yet have complete answers, but early signals are emerging from Claude Code usage data.

This report provides evidence of how Claude Code is actually being used, based on privacy-preserving analysis of approximately 235,000 users and around 400,000 interactive sessions between October 2025 and April 2026. It builds upon our prior research into autonomy metrics in Claude Code sessions and how Claude Code is transforming work within Anthropic. This paper introduces a framework for describing the use of interactive AI programming assistants: what work people are doing, who is doing it, and whether the work succeeds. We focus on usage of Claude Code via the command-line interface (CLI), Claude.ai, or the Claude Code desktop application. By tracking how agent-based programming usage evolves as model capabilities improve, we can better understand the impact of these tools on the labor markets for professional programmers and knowledge workers.

What’s happening with Claude Code may foreshadow the future of knowledge work: agents are gradually being integrated into non-coding tasks. We’ve found that Claude is handling more complex and higher-value tasks. Meanwhile, a clear division of labor still exists in agent programming: humans decide what to build, and agents decide how to build it.

We also see evidence that domain expertise, rather than programming proficiency, is what truly enhances the effectiveness of these tools. In particular, domain experts are more likely to succeed and recover more easily from errors and misunderstandings. However, the gap between experts and intermediate users is not large, suggesting that with sufficient proficiency in a domain, users can utilize these tools almost as effectively as deep experts.

These findings allow us to observe early signs of potential shifts in the labor market. In our data, success depends on whether a person understands the problem they are trying to solve, not on whether they have received programming training. If these patterns hold across the broader economy, it suggests that agent programming tools, while potentially absorbing some implementation-focused tasks, are simultaneously rewarding those who truly understand the problems inherent in their work. Coding agents are not replacing domain expertise; rather, the more understanding a worker brings to the agent, the more high-quality work the agent can accomplish.

Division of labor

What do people use Claude Code for?

To understand how people use Claude Code, we categorize each session into one of nine work modes—the single activity that best describes the session’s goal. Four of these modes directly involve writing or maintaining code: building something new, fixing broken things, testing code, and orchestrating other agents or automation pipelines. Another category is software operations, including deployment, configuration, running pipelines, and monitoring systems. Two additional modes focus more on figuring out “what to do”: understanding how an existing system works, and planning changes before making them. The final two modes are unrelated to code, or where code is only a secondary component of the final output: analyzing data, and communicating through presentations and other text-based documentation.

Approximately 56% of sessions consist of writing code (25%), fixing code (26%), or testing and orchestrating code (5%). Software operations account for 17%, planning or exploration for 14%, and analyzing or writing text for 13% (see Figure 1).

We first have the model review the conversation logs and classify each session accordingly; we then use our privacy-preserving analysis tool to cross-validate these classifications against the telemetry data automatically recorded for each session, including whether code lines were added or removed. There is high consistency between the two data sources. For example, in sessions classified by our classifier as creating or modifying code, over 90% also show code changes in the telemetry data. See the appendix for details.

Who makes the decision?

How autonomous is Claude Code? Capability assessments show that its upper limits are already high and continue to rise. For example, in benchmark tests such as METR’s time-span evaluation, state-of-the-art models can now autonomously complete software tasks that previously required humans to spend hours, overcoming obstacles along the way. But how does this play out in real-world use? Here, we examine the extent to which humans and Claude each contribute to guiding the conversation in actual interactions.

We examine this issue from two perspectives. First, we assess the extent to which people delegate decisions to Claude; second, we observe how much action they assign to Claude. To understand the division of decision-making within a conversation, we built a privacy-preserving decision attribution classifier based on conversation content. We instructed the classifier to list all meaningful decisions in the conversation and categorize them as planning decisions or execution decisions. Planning decisions include what to do, which approach to adopt, and what constitutes completion; execution decisions include which files to modify, what code to write, which programming language to use, and which commands to run. The classifier then attributes each decision to either Claude or the user, generating two metrics for each conversation: the percentage of planning decisions handled by the user, and the percentage of execution decisions handled by the user.

On average, humans make about 70% of planning decisions but only 20% of execution decisions (see Figure 2). In practical use, agent programming establishes a clear division of labor: humans decide what to build, while agents decide how to build it.

To understand the degree of action delegation in a session, we examine the session structure rather than the content. A Claude Code session consists of back-and-forth interactions between Claude and the user: the user sends a prompt, Claude performs actions; then the user sends the next prompt, and so on. In a typical session, this cycle occurs about four times. Based on our historical data from October to April, each user prompt triggers, on average, about 10 actions by Claude, sometimes exceeding 100 actions. In each round, Claude reads files, edits code, runs commands, and outputs an average of 2,400 words.

The amount of work Claude completes between user checks largely depends on who is making the decisions. When users retain control over the execution process—making over 80% of the execution decisions—Claude performs fewer actions per round, approximately 8. In contrast, when Claude holds planning control—making over 80% of the planning decisions—it performs the highest number of actions, approximately 16.

Professional level

Based on each session record, Claude evaluates the user’s apparent level of expertise on the task using a five-point scale, ranging from novice to expert. The expertise classifier focuses on three signals: the precision of the user’s instructions, what the user asks Claude to verify, and whether the user frequently corrects Claude or vice versa. It is important to note that this level of expertise is entirely distinct from job title or general ability; crucially, it is task-specific. A senior engineer asking their first question about Rust is still a novice on Rust tasks. An accountant who has never used Python can be an expert on a Python task if they can precisely tell Claude which reconciliation rules the script must follow and identify boundary cases it mishandles during month-end closing.

The table below shows how we define different levels of expertise in our classifier, along with example requests from the publicly available coding agent conversation dataset SWE-chat. Conversations classified as "beginner" provide general instructions without demonstrating domain-specific knowledge; conversations classified as "expert" convey a deep understanding of the codebase and technical environment.

We quantified the relationship between professional expertise and the output volume and number of actions generated per prompt from Claude. In a typical novice conversation, each prompt triggers approximately 5 actions and generates around 600 words; in expert conversations, the chain of actions is more than twice as long—about 12 actions—and the output reaches approximately 3,200 words, five times that of novice conversations (see Figure 3). This gap between novices and experts appears across every type of work and every task value range.

These metrics complement our prior research on Claude Code’s autonomy, which tracked the duration of agent operation and how frequently users automatically approved its actions. In contrast, our decision attribution metrics capture who makes substantive decisions throughout an entire session, while the volume of outputs and number of actions triggered by each prompt measure the extent to which each human instruction elicits autonomous activity from Claude.

Who is using Claude Code, and what are they using it for?

User

To understand who is performing these tasks, we infer each user’s occupation from conversation logs and map it to one of the 23 major categories in the U.S. Bureau of Labor Statistics Standard Occupational Classification (SOC) system. The classifier is instructed to rely solely on the following signals: the context of items loaded at the start of the conversation, file names and structures, references to materials or outputs provided by the user—such as legal documents, clinical data, financial reports, or course materials—and the vocabulary used by the user. The classifier is explicitly instructed not to treat “writing code” alone as evidence that the user is engaged in a programming occupation. A conversation is assigned to a coding-related SOC category—“Computer and Mathematical Occupations”—only when there are clear signals indicating that software or data work constitutes the user’s profession. For example, if a lawyer builds a script to automatically check whether certain clauses are missing from a set of contracts, the conversation will still be classified under the legal profession, even if the primary activity involves writing software. If no signals about the user’s occupation are present, the conversation remains unclassified.

We were able to infer occupations in approximately 70% of conversations. Among these classified conversations, "Computer and Mathematical Occupations" was the largest group, which is not surprising, as this category encompasses most software-related jobs. This was followed by Business and Financial Operations, Arts, Design, and Media, Management, and Life Sciences, Physical Sciences, and Social Sciences. In our sample, the fastest-growing non-software occupational groups were Management, Sales, and Legal Professions.

Work

From October 2025 to April 2026, the composition of work completed using Claude Code changed significantly. The most notable change was a decline in sessions focused on fixing broken code, dropping from 33% to 19% (see Figure 4). In its place, there was increased activity around coding tasks. The share of sessions involving software operations rose from 14% to 21%. Writing and data analysis roughly doubled, increasing from approximately 10% to approximately 20%.

The intrinsic value of tasks is also increasing. We approximate the economic value of each session by estimating the cost of similar work on freelance markets and calibrate using real, publicly available job data. According to this metric, the estimated value of an average session rose by 27% between October and April. This increase occurred across multiple types of tasks: the value of building, operating, and repairing tasks increased by approximately 43%, 34%, and 32%, respectively. These price estimates are approximate, so we primarily use them to compare trends across task types over time rather than as direct dollar values. For details on how the task value estimator was constructed, see the appendix.

Success depends on what the user brings.

Estimating the value of tasks is one way to understand how Claude Code helps people get work done. Another perspective is to examine how many sessions succeed and which session characteristics correlate with success. Across all success metrics, we observe a clear pattern: the higher the user’s level of expertise during a session, the greater the likelihood of success. Most of the improvement is concentrated at the lower end of the expertise spectrum—meaning the gap between novice and intermediate users is larger than the gap between intermediate and expert users.

Before analyzing the characteristics of successful sessions, we must precisely define how success is measured. We cannot observe users’ real-world outcomes or directly ask whether they accomplished their intended goals using Claude. Therefore, we rely on two complementary, session-record-based metrics. The first is “judged success,” where a classifier reviews the full session record and determines whether the user achieved their original goal, with options including success, partial success, failure, or no clear goal. Subsequently, two accompanying classifiers assess the strength of evidence for this judgment to determine “validated success.” The success signal classifier looks for verifiable evidence of success, particularly including git activity matching the task—such as commits and pull requests, passing test suites, and explicit user acknowledgment. It scores sessions on a scale from “no signal” to “weak signal” (1 point) to “multiple strong signals” (5 points). A parallel failure signal classifier scores evidence of failure, including errors, test failures, repeated attempts at the same task, and user objections to outputs. Validated success requires both conditions to be met: the session is judged as successful, and at least one strong, verifiable success signal is present. The following analysis focuses on the degree of success or failure in sessions; therefore, we exclude sessions classified by the success outcome classifier as “no clear goal,” which account for approximately 7.7% of the full sample.

Professional-level returns

So, which sessions are most likely to succeed? The results show that the session proficiency ratings described above have a significant impact on session success.

Some may worry that expertise level is not the true driving factor—perhaps experts simply chose different tasks or differed in other ways. In this section, we partially address this concern by comparing conversations with the same job type, same estimated value, same month, same topic, and from the same broad occupational group, examining how varying user expertise levels affect the results.

Among all success metrics, the higher the user’s demonstrated level of expertise during a session, the more likely the session is to succeed. Sessions rated as beginner achieved a verified success rate of 15% and at least partial success in 77%. In contrast, sessions rated as intermediate or higher achieved verified success rates of 28% to 33% and partial success rates of 91% to 92% (see Figure 5).

In each metric, most of the gains come from the improvement from beginner to intermediate levels; the slope flattens from intermediate to expert. For details of the regression analysis behind Figure 5, see the appendix.

Similar gradients can also be observed in challenging sessions. When failure signals are recorded as verified evidence of failure, we consider the session to be “problematic.” This may include errors occurring, test failures, multiple attempts to accomplish the same task, or users expressing frustration and dissatisfaction. In problematic sessions, after controlling for all the above variables, the proportion of verified successes increases from 4% in novice sessions to 15% in expert sessions (see Figure 5). Using a more lenient success metric, we find that the proportion of at least partial successes is 60% among novice users and 80% to 81% among intermediate to expert users.

We also examined the inverse relationship between expertise level and various failure metrics. Note that in this analysis, sessions classified as failures are those that did not achieve even partial success. If a session encountering issues was deemed a failure and no code lines were written, we labeled it as abandoned. Among sessions where users appeared to be beginners, 19% were ultimately abandoned; in other user groups, this rate ranged from 5% to 7%. In other words, users with the least experience are more likely to abandon their efforts when struggling to achieve their goals. One key value of expertise appears to be the ability to guide the agent back on track.

Occupation may be less important than professional expertise.

Users in software-related professions had an empirically verified success rate of approximately 30% across all sessions, compared to about 26% for users in other professions. In sessions that generated code—that is, sessions involving at least one new or modified line of code—these figures were 34% and 29%, respectively (see Figure 6). With a more lenient definition of success, the gap between software-related and other professions narrows further: in code-generating sessions, 89% and 88% of users in the two groups achieved at least partial success. A five-percentage-point difference is modest and remained stable over seven months, even as success rates increased for both groups. Among the ten largest professional groups in our dataset, each differed from software engineers by no more than seven percentage points in success rate. Management-related professions achieved the highest empirically verified success rate, slightly exceeding that of software engineering roles. The higher success rate among managers may reflect that managerial skills transfer well to the task of directing agents. However, this may also be partially attributable to our measurement approach: verification depends to some extent on explicit user confirmation during the session, and managers may be more accustomed to expressing approval once they achieve their desired outcome.

Outlook

The results of this report paint a picture in which agent-based programming is amplifying certain knowledge and skills while replacing others. In coding-related conversations, success rates across major professions are similarly close to those in software-related roles. It appears that coding agents are reducing the importance of having a programming background for successfully completing programming tasks.

Meanwhile, successful sessions are more likely to demonstrate domain expertise. Sessions rated as expert-level have more than twice the success rate of novice sessions. When encountering problems, novices are several times more likely to give up than other users. The collaboration style itself further clarifies this picture: domain experts can guide Claude to accomplish more with each instruction. Therefore, the ability to steer Claude toward success stems more from mastery of a domain than from coding proficiency. Anyone with this kind of domain mastery can now accomplish technical tasks that were previously out of reach. Those lacking such domain understanding, even when using the same tools, achieve significantly less. Moreover, the primary gains come from competence, not mastery. A practical, operational understanding of a domain is sufficient to capture most of the benefits; deeper specialization adds only marginal additional advantage.

These findings are still preliminary. As with most of our research, we cannot measure real-world outcomes, such as whether code written during a session was later used or discarded, or whether it generated economically valuable results. Additionally, non-interactive usage—which constitutes a significant portion of overall activity—is excluded from this report. Developing a framework to measure such usage is one of the key priorities for future work. Furthermore, all our session classifications rely on the model’s interpretation of session logs. In the appendix, we show that the classifier aligns with independent telemetry data in the expected direction and agrees with strong reference models on most sessions. However, validating the classifier at scale remains challenging; Claude Code sessions themselves add difficulty, as they may be excessively long and complex to serve as reliable human-annotated ground truth.

As models, users, and the division of labor between them continue to evolve, the landscape presented in this report will be continuously updated. We hope these metrics will help us track significant shifts underway. For example, if the returns from professional expertise begin to decline, it would indicate that models are increasingly providing the critical judgment currently offered by users, extending the benefits of these tools beyond domain experts to a broader population. If the proportion of users outside software professions successfully completing coding sessions continues to rise, it may signal that software production is becoming a routine part of work across disciplines, rather than the exclusive domain of a single profession. These shifts will alter who benefits from agent-based programming, to what extent, and impact the most valued skills in the labor market.