Claude / Anthropic News & Releases Research & Benchmarks

Novices quit. Experts adapt. 400k Claude Code sessions say so.

Anthropic's 398k-session study finds Claude Code widens expertise advantage — 15% novice success vs. 33% for experts.

Jun 23, 2026

Novices quit. Experts adapt. 400k Claude Code sessions say so.

For three years the promise of AI coding tools has been leveling: hand a beginner an agent and they ship like a senior. Anthropic just analyzed roughly 398,000 Claude Code sessions and found close to the opposite.

Does Claude Code Narrow the Novice-Expert Divide?

No. On the largest evidence available, agentic coding widens the gap rather than closing it. Anthropic's June 16, 2026 study "Agentic coding and persistent returns to expertise" examined about 398,198 interactive sessions from roughly 234,751 people between October 2025 and April 2026, and concluded that Claude Code raises the ceiling for people who already know what they are doing while leaving novices behind . The scarce input does not disappear — it moves.

Quick Answer: No. Anthropic's analysis of about 398,198 Claude Code sessions (Oct 2025–Apr 2026) found agentic coding widens the expert advantage rather than flattening it. Users still make ~70% of "what to do" decisions while Claude handles ~80% of "how to do it" — so the leverage favors whoever already understands the problem.

What changes is the bottleneck. As Claude absorbs more of the implementation work, raw hands-on coding ability matters less and specification, oversight, and domain knowledge matter more — exactly the skills that already correlate with seniority . The tool democratizes typing code, not the judgment about which code is worth writing.

The clearest signal is the division of cognitive labor. Across the dataset, users make about 70% of the planning decisions ("what to do") while Claude makes about 80% of the execution decisions ("how to do it") . A typical session runs about four user turns, with each prompt triggering roughly 10 Claude actions and around 2,400 words of output . The human stays in the loop as the director; Claude operates an observe→think→act harness until the prompt's completion criteria are met. That split structurally rewards anyone who can frame a problem precisely and verify the result.

"Agentic coding tools do not flatten the gap between beginners and veterans — they widen it, with the scarce input shifting from hands-on coding ability toward task and domain expertise." — Anthropic, Agentic coding and persistent returns to expertise (source: Anthropic Research, 2026-06)

This reframes the rest of the data. If you already understand a domain, Claude Code is a force multiplier; if you don't, it is a faster way to produce confident-looking output you can't evaluate. The sections that follow break down exactly where that gap appears — in how experts prompt, how often sessions complete, and why occupation barely predicts success.

Who Directs, Who Implements: The 70-80 Labor Split

The division of labor inside Claude Code is measurable: users make roughly 70% of planning decisions — "what to do" — while Claude makes about 80% of execution decisions — "how to do it" . You stay responsible for scope, constraints, and acceptance criteria; the agent owns the implementation path. A typical session runs about four user turns, and each prompt triggers roughly 10 Claude actions and about 2,400 words of output on average . The human rarely types the implementation — they define it and check it.

Mechanically, Anthropic frames Claude Code as an agent harness running an observe→think→act loop locally, iterating until the prompt's completion criteria are met while keeping the developer in the loop . That loop is why a single turn fans out into roughly ten tool calls: the agent reads files, edits, runs tests, and re-checks without waiting for you between each step. Your leverage is front-loaded into the prompt that opens the loop and the verification that closes it.

Decision type	Who decides	Per-turn footprint
Planning ("what to do")	~70% user	~4 user turns per session
Execution ("how to do it")	~80% Claude	~10 actions, ~2,400 words per prompt

The practical consequence is that prompt quality — how precisely you scope the task, which constraints you impose, and what verification you demand — becomes the primary lever for both output volume and correctness. Typing speed is no longer the bottleneck; specification is. A vague prompt still launches the same observe→think→act loop, but with looser completion criteria, so the agent produces more code you then have to evaluate yourself. A tightly framed prompt lets the loop run longer and further before it needs you again, because the success condition is unambiguous. That asymmetry — same harness, very different yield depending on who is driving — is what the next section quantifies through the actions and output gap between novice and expert sessions.

How Expert Instructions Differ From Novice Ones: 12 Actions vs. 5

Expert prompts make the agent do roughly 2.4× more work and produce more than 5× the output per turn compared with novice prompts. In novice-rated sessions, each prompt triggered about 5 Claude actions and roughly 600 words of output; in expert-rated sessions, about 12 actions and roughly 3,200 words . Same harness, same observe→think→act loop — the difference is entirely in how the prompt is framed.

Anthropic did not ask users to self-report skill. It rated apparent task-specific expertise on a five-point novice-to-expert scale using a Claude Sonnet 4.6 classifier, inferred from three behavioral signals: how precisely users framed instructions, what they explicitly asked Claude to verify, and the direction of corrections — whether the user corrected Claude, or Claude corrected the user . A prompt that names the file, the constraint, and the acceptance test reads as expert; a vague "fix this" with the user later walking back Claude's output reads as novice.

The per-prompt gap by expertise level looks like this:

Signal per prompt	Novice session	Expert session
Claude actions triggered	~5	~12 (≈2.4×)
Words of output	~600	~3,200 (>5×)

The raw counts could be confounded — experts might just take on bigger tasks, or work in heavier languages, or arrive after a model upgrade. So Anthropic ran a controlled regression to isolate the expertise effect on its own. After controlling for work mode, task value, month, occupation, and model family, each additional level of expertise independently predicted about +9% Claude actions and +13% output, significant at p < 0.001 . The effect survives the controls, so it is not an artifact of experts simply picking larger jobs.

The mechanism is delegation depth. Experts can let the loop run longer and more autonomously because they specify constraints up front, hand Claude a checkable success condition, and steer with corrections rather than rewrites — so the agent does more per turn before it has to stop and ask. This is consistent with Anthropic's separate autonomy work, where Claude Code's 99.9th-percentile turn duration nearly doubled from under 25 minutes to over 45 minutes between October 2025 and January 2026, and experienced users with around 750 sessions ran full auto-approve in over 40% of sessions versus roughly 20% for new users . Longer leashes are something developers earn through better specification and verification habits, not something the model hands out by default.

Completion Probability by Expertise Bracket — and the Abandonment Cliff

Completion probability scales steeply with rated expertise, but the steepest gains arrive early. Under Anthropic's strictest "verified success" measure — a session backed by hard evidence such as passing tests, matching git activity, pull requests, or explicit user affirmation — only about 15% of novice sessions succeed, against roughly 28% for intermediate and 33% for advanced/expert users . The bulk of the lift comes from moving novice to intermediate, not intermediate to expert.

The pattern softens on a looser bar. "At least partial success" climbs from about 77% for novices to 91–92% for intermediate users and above . In other words, most experienced users get something usable out of nearly every session; the expertise premium shows up mainly when you demand provable, finished work.

Expertise level	Verified success	At least partial success
Novice	~15%	~77%
Intermediate	~28%	91–92%
Advanced / expert	~33%	91–92%

Source: Anthropic, "Agentic coding and persistent returns to expertise", 2026-06 .

The sharpest divide is recovery. Among troubled sessions — those that go off the rails mid-task — verified success rises from about 4% for novices to roughly 15% for experts . That near-fourfold gap is the starkest signal in the dataset: when an agent run goes wrong, knowing how to diagnose, constrain, and re-steer it is what separates a salvaged session from a dead one.

Then there is the abandonment cliff. Roughly 19% of troubled novice sessions are abandoned with zero lines of code written, versus just 5–7% for intermediate users and above . Novices who hit trouble are three to four times more likely to walk away empty-handed — not with broken code to debug, but with nothing at all.

"Claude Code raises the ceiling for competent users but does not remove the need for judgment," Anthropic concludes, noting that recovery from troubled sessions is where the novice-to-expert gap widens most (source: Anthropic, 2026-06).

The practical read for teams: the agent reliably produces partial output for almost anyone, but converting that into verified, mergeable work — and rescuing runs that derail — is a learned skill. Onboarding novices with explicit training in specification and verification likely moves the needle more than any model upgrade.

Field Knowledge as the Lever: Why Occupation Barely Predicts Completion

Subject-matter knowledge, not a software credential, is what separates a finished artifact from an abandoned one. Anthropic inferred occupation in about 70% of sessions using U.S. Bureau of Labor Statistics major occupation groups, and explicitly instructed the classifier not to treat the act of writing code as proof that the user was a software professional . That guardrail matters, because it lets the data answer a sharper question: once someone understands their own problem well enough to direct and check the agent, does their job title still predict whether they ship?

It barely does. In code-producing sessions, users in software and math occupations reached verified success about 34% of the time, versus about 29% for everyone else — a spread of only five percentage points . On the looser "at least partial success" measure the gap nearly vanishes: 89% for software/math against 88% for non-software occupations . More striking, every one of the ten largest inferred occupation groups — managers, analysts, scientists, designers, legal and business roles — landed within about seven percentage points of software/math users on verified success . The occupational hierarchy that governs hiring for engineering roles does not reproduce itself in completion rates.

The mechanism is the same one that distinguished expert from novice prompting: people who know their own domain specify constraints precisely, ask the agent to verify the things that actually matter, and catch wrong output because they recognize it. A lawyer who understands the edge cases in a contract-review workflow, or an analyst who knows where a forecasting model breaks, can direct Claude to a correct artifact at rates close to a software professional — not because they write better code, but because they frame and check the work better.

"The scarce input is shifting from hands-on coding ability toward task and domain expertise — people still mostly decide what to build, while Claude mostly decides how to implement it." — Anthropic, Agentic coding and persistent returns to expertise (source: Anthropic, 2026-06).

The practical implication for technical founders is that the hiring filter loosens on one axis and tightens on another. You can hand implementation to a domain expert who has never shipped production code, provided they can articulate requirements and judge correctness. What you cannot delegate is the understanding of the problem itself — that remains the binding constraint, and it is exactly what an agent harness cannot supply.

From Repair-Heavy to Construction-Heavy: How the Activity Mix Shifted

Over Anthropic's six-month window, Claude Code sessions shifted measurably from fixing things to building and operating them. Fixing broken code fell from about 33% of sessions in October 2025 to roughly 19% by April 2026, while operating software climbed from about 14% to 21%, and writing plus data analysis roughly doubled from about 10% to 20% . The agent harness is being pointed at construction and runtime work, not just cleanup.

Averaged across the full corpus, the mix is still code-dominated but broader than a debugging tool. Anthropic classified about 56% of sessions as code work — 25% writing or building, 26% fixing, and 5% testing or orchestrating — alongside 17% operating software, 14% planning or exploring, and 13% analysis or prose . The fixing-to-building ratio inverts over time even though both remain large, which tracks the earlier expertise pattern: as users learn to specify and verify, they delegate net-new construction rather than firefighting the agent's earlier output.

Task category	Oct 2025	Apr 2026	Direction
Fixing broken code	~33%	~19%	Down ~14 pts
Operating software	~14%	~21%	Up ~7 pts
Writing + data analysis	~10%	~20%	Roughly doubled

The composition change lines up with a value signal. Anthropic's relative task-value proxy — benchmarked against freelance-marketplace postings, not literal dollars — rose about 27% on average across the window, with building tasks up roughly 43%, operating up about 34%, and fixing up about 32% . Anthropic explicitly cautions against reading the dollar figures literally, so treat these as directional weightings rather than revenue estimates. The pattern is internally consistent: the work that grew fastest in share (building and operating) is also the work the proxy values most.

Separate Economic Index reporting frames the same shift at a higher altitude, noting heavy speedups by task complexity — roughly 9x for high-school-level tasks and about 12x for college-degree-level tasks — and coding increasingly migrating into programmatic and agentic API workflows . For a developer or technical founder, the practical read is that Claude Code is maturing from a remediation aid into a general construction and operations surface. The migration from repair-heavy to build-heavy usage is what you would expect once users trust the agent enough to start things, not only patch them — a trust curve documented elsewhere in Anthropic's autonomy work .

Methodology Limits: Observational, Not Causal

Every headline figure in this study rests on machine-generated labels, not human annotation — a constraint that shapes how confidently you should read the expertise gradient. Anthropic classified all 398,198 sessions with Claude Sonnet 4.6 rather than having researchers read transcripts , so the novice-to-expert ratings and inferred occupations carry measurement error. The company says it cross-checked labels against telemetry and applied privacy-preserving aggregation thresholds , but a classifier inferring "expert" from how precisely someone phrases a prompt is still a proxy, not ground truth.

The deeper limit is design: this is observational, not a randomized experiment. It cannot show that Claude Code causes productivity gains, because the correlation between rated expertise and outcomes may reflect selection — people who already specify, constrain, and verify well may also be the ones who reach for harder tasks. Anthropic is explicit that the work describes usage patterns, not causal effect . As the study frames its own boundary:

"Our analysis is observational and cannot establish whether Claude Code causes productivity improvements," — Anthropic Economic Index team (source: Anthropic, 2026-06).

"Verified success" inherits a third blind spot. The strictest measure only counts wins it can detect — passing tests, matching git activity, pull requests, or explicit user affirmation . A developer who quietly ships working code without a commit trail, or who abandons a session that was actually fine, is invisible to that lens. The study also does not observe whether generated code is later kept, reverted, or creates real economic value, and its dollar estimates are proxied against an imperfect freelance-marketplace benchmark .

Finally, scope. The dataset deliberately excludes non-interactive headless claude -p runs, SDK usage, and third-party IDE integrations — precisely the channels where a lot of professional, automated production work likely happens. For a technical founder, that means the sample skews toward hands-on terminal sessions and probably underweights the agentic, API-driven workflows that Anthropic's broader Economic Index reporting flags as a growing share of coding activity . The directional findings hold; the exact magnitudes deserve caution.

The METR Discrepancy: Why Two Studies Point in Opposite Directions

If Claude Code makes capable users faster, why did a respected randomized trial find the opposite? In early 2025, METR ran a randomized controlled trial in which 16 experienced open-source developers completed 246 tasks on repositories they already knew well, and they worked about 19% slower with AI tools — even though they expected a 24% speedup beforehand and still perceived a 20% speedup afterward . That is a real, measured slowdown, and it sits squarely against Anthropic's June 2026 finding that experts delegate longer chains and finish more often.

The two results are less contradictory than they look once you line up the experimental populations. They differ on three axes that plausibly flip the sign of the productivity effect:

Repo familiarity. METR studied experts on mature codebases they had deep context on — the regime where a human already holds the mental model an agent has to reconstruct from scratch.
Tooling maturity. METR used early-2025 AI tooling; Anthropic's 398,198 sessions span October 2025 to April 2026, a window of markedly more capable agentic harnesses .
Problem novelty and population. Anthropic captured diverse occupations tackling problems that were often novel to the user, where the agent's breadth offsets the user's missing implementation detail rather than competing with established expertise .

Read together, the likely reconciliation is contextual: familiar, mature codebase plus early-2025 tooling produces a slowdown, while novel-to-user problems plus post-October-2025 agentic tooling produces a speedup. As Anthropic's researchers frame it in the study, "agentic coding tools do not flatten the gap between beginners and veterans — they widen it," with the scarce input shifting from hands-on coding toward task and domain expertise . METR's experts were slowed precisely because, on home turf, their own fluency was the asset the tool partly displaced.

The robust conclusion both studies share is the one that should anchor your decision: AI assistance does not flatten expertise. The magnitude — and even the sign — of the productivity effect depends heavily on repo familiarity and tooling generation. The concrete takeaway for a technical founder: don't generalize from a single benchmark. Pilot agentic coding on greenfield or unfamiliar work where breadth pays off, keep a human with deep context steering changes to your mature core, and re-measure as tooling advances — because in this field, last year's result expires fast.

Frequently asked questions

Why do experts get so much more out of Claude Code than novices?

Experts specify tasks more precisely, constrain scope, know exactly what to verify, and correct Claude more effectively — which lets the agent run longer autonomous chains without drifting off-task. The scarce skill is directing and checking the agent, not hand-writing the implementation. Anthropic's data shows it concretely: expert-rated prompts triggered about 12 Claude actions and roughly 3,200 words of output, versus about 5 actions and 600 words for novices . A controlled regression isolates the effect at roughly +9% actions and +13% output per expertise level, significant at p < 0.001 .

What counts as "verified success" in Anthropic's Claude Code study?

"Verified success" is the study's strictest measure: a session judged successful and backed by hard evidence — passing tests, matching git activity, a pull request, or explicit user confirmation. Under that bar, only about 15% of novice sessions qualify, versus roughly 28% for intermediate and 33% for advanced/expert users . The looser "at least partial success" measure is far less discriminating, climbing from about 77% for novices to 91–92% for everyone above them .

Can a non-developer produce working code with Claude Code?

Yes. Inferred occupation barely predicts completion — domain knowledge, not coding pedigree, is the dominant lever. In code-producing sessions, software/math occupations reached verified success about 34% of the time versus about 29% for non-software occupations, and every one of the ten largest inferred occupation groups landed within about seven percentage points of software/math users on verified success . Lawyers, managers, analysts, and scientists can ship software artifacts when they understand the problem well enough to direct and verify the agent.

Does Claude Code make development objectively faster?

The Anthropic study cannot answer that — it is observational, not a randomized experiment, so it makes no causal productivity claim, and it does not track whether generated code is later kept or reverted . Pointing the other way, METR's 2025 randomized controlled trial found 16 experienced open-source developers completed 246 tasks about 19% slower with early-2025 AI tools, despite expecting a 24% speedup . Context — repo familiarity, tooling maturity, and problem novelty — dominates the outcome.

How was expertise measured in the Anthropic Claude Code study?

A Claude Sonnet 4.6 classifier rated each session on a five-point novice-to-expert scale, never read by humans . The rating was inferred from three signals: how precisely the user framed instructions, what the user asked Claude to verify, and the direction of corrections — whether the user corrected Claude or Claude corrected the user. Because the labels are model-generated, they carry measurement error, though Anthropic says they were checked against telemetry and privacy-preserving aggregation thresholds across about 400,000 sessions .