In July 2025, a small research organisation called METR ran a study that nobody in the AI industry wanted as their summer reading. Sixteen experienced open-source developers were given 246 real tasks on codebases they had each contributed to for years. The tasks were randomly split: half completed with state-of-the-art AI assistance available, half without. Before the experiment began, the developers predicted on average that AI would make them twenty-four percent faster. After the experiment ended, they still believed they had been twenty percent faster. The measured result was that they were nineteen percent slower.¹

That gap — between perceived productivity and measured productivity, in the same engineer on the same day — is the single most important data point in the AI-coding conversation, and it points to why the question "does AI coding work?" has resisted a clean answer for eighteen months. The honest reading is that it works, and it doesn't, and the two are very easy to confuse.

Developers expected to be 24% faster. They felt 20% faster. They were 19% slower.

This essay is an attempt to assemble the public evidence on what AI code generation actually does to engineering output.

What the controlled evidence says

Fig. 1 The METR finding. Developers expected +24%, felt +20%, were measured at −19%. The gap between perception and measurement is the article's spine.

The first thing the public evidence makes clear is that controlled studies of AI coding tools disagree with each other — and that the disagreement is structured, not random.

On one side, GitHub's own randomised study, published in 2022 and replicated in expanded form in 2024, found that developers using Copilot completed a constrained scripting task fifty-five percent faster than a control group.² The finding has been widely cited and is real. The constraints matter: the task was bounded, the codebase was greenfield, the developers were not deeply familiar with the surrounding system. In conditions that approximated a clean exercise, the productivity gain was substantial.

On the other side, the METR result and the DORA 2024 State of DevOps report point the opposite direction. DORA, the longest-running quantitative study of software delivery, found that increased AI adoption among surveyed teams correlated with a 1.5 percent decrease in delivery throughput and a 7.2 percent decrease in delivery stability for every twenty-five percent increase in AI tool use.³ The DORA authors were careful: correlation is not causation, and AI adoption may track other factors that depress throughput. The headline finding nevertheless survives scrutiny — among the teams they measured, more AI was associated with worse, not better, delivery outcomes.

The reconciliation between these results is now reasonably well understood. AI coding tools deliver large measurable gains on tasks that are bounded, well-specified, and detached from a deep system context. They deliver flat or negative gains when the task is the opposite: unbounded, ambiguous, deeply entangled with an existing codebase and the developer's accumulated knowledge of it. Most enterprise software work, particularly in regulated industries, sits closer to the second category than the first.

A useful corrective is Anthropic's 2026 Agentic Coding Report, which surveyed engineering teams running Claude Code in production. The report found that AI was being used in roughly sixty percent of coding work, but that the share of work being fully delegated to AI sat between zero and twenty percent.⁴ The gap between assisted and delegated is where the entire operational question now lives.

Where the evidence converges

Across studies, certain categories of work emerge repeatedly as durable wins.

Boilerplate and structural transformation. Generating Terraform modules, REST or GraphQL endpoint scaffolding, data class mappings, configuration files, test fixtures, build configuration — work that is well-specified, pattern-following, and tedious to produce by hand. The output requires review, but editing a near-correct draft is consistently faster than typing from scratch, and the productivity effect compounds when the work is repetitive.

Code exploration and navigation. Developers, particularly those new to a codebase, use AI tools to ask "what does this service do" or "where is this exception handled" and receive useful starting answers. This is closer to a faster grep than to authorship. Atlassian's 2024 developer experience research found that the largest reported time savings from AI tools were in code understanding, not code production — a finding consistent with the broader pattern.⁵

First-pass test generation. AI tools produce serviceable initial test suites for pure functions and data transformations. The catch, which the literature is now clear on, is that test coverage produced by AI is not equivalent to test coverage written with judgement. Coverage is generated; whether the tests assert the correct behaviour remains a human question.

Acceleration of legacy migration. Several large-scale industry case studies — Airbnb's documented migration of its testing infrastructure, Google's published reports on AI-assisted code modernisation, and others — describe AI tools meaningfully compressing structural transformation work where the source and target patterns are well-known. The effect is repeatable and measurable.

Where it converges on real failures

The same body of evidence is unambiguous about where AI coding tools currently underperform.

Complex domain logic. Anything where correctness depends on understanding why and not just what — financial rounding rules, cross-border tax handling, regulatory constraints, clinical decision logic, insurance underwriting rules — produces outputs that look plausible and pass casual review but fail in the cases that matter. Published incident reports from 2024 and 2025 trace production defects to AI-generated code that handled the canonical case correctly and the regulatory edge case incorrectly.⁶ The dangerous property is not that the code is wrong; it is that the code looks right.

Architectural and cross-system decisions. AI tools operate effectively at the file and function level, less effectively at the service boundary, and poorly at the architectural level. They do not have a working model of why a system was designed with the boundaries it has, which scaling constraints shaped it, or which decisions were trade-offs against future flexibility. When developers lean on AI for structural decisions, the result is locally reasonable code that is globally incoherent — a pattern visible in published post-mortems and increasingly common in code-review datasets.

Cumulative codebase health. A 2024 study by GitClear, analysing roughly 153 million changed lines of code across an industry sample, reported a measurable rise in code churn — code added and then reverted or substantially rewritten within two weeks — coinciding with widespread AI tool adoption.⁷ The finding is correlational and contested, but it points at something the qualitative evidence also suggests: AI tools make adding code cheaper than removing it, and the long-term effect on codebase complexity is not yet a settled question.

The hidden tax

The unbudgeted cost of AI-generated code is review, and the literature on it has thickened considerably in the past year.

When a developer submits a pull request containing AI-generated code, the reviewer has less context than they would for hand-written code, and the author has less context than they normally would about their own submission. Practitioner studies from Faros AI and Codacy in 2025 have documented an increase in review cycle time on pull requests containing significant AI-generated content, with the magnitude varying widely by team and language.⁸ The exact percentage is contested. The direction is not.

The value of an AI coding tool in production is bounded above by the quality of the review system surrounding it.

Teams that have adopted AI coding tools successfully report having tightened, not loosened, their review and CI practices around them — stricter linting, mandatory annotation of AI-generated sections, smaller and more frequent pull requests, stronger pre-merge testing.

There is also a less measurable but widely reported phenomenon that engineers have begun calling the autocomplete trap: the habit of accepting plausible suggestions because they are available, rather than because they are correct. The Stack Overflow Developer Survey 2024 found that while seventy-six percent of developers were using or planning to use AI tools, only forty-three percent trusted the accuracy of their output, and only two percent expressed high trust.⁹ That gap between use and trust is itself a data point. Used reflectively, AI tools accelerate good engineers. Used reflexively, they can erode the habits that produced the good engineering in the first place.

From autocomplete to delegation

The picture above describes AI coding in its first form: an autocomplete-style assistant inside an IDE. The picture has begun to change.

Over the past six months, the leading edge of AI coding has shifted from autocomplete-style suggestions to agent-style delegation. Claude Code's Swarm Mode, officially launched alongside Claude Opus 4.5 in November 2025, spawns a lead agent that plans the work and specialist sub-agents that execute parts of it in parallel, each in an isolated git worktree.¹⁰ Anthropic's lead engineer for the product, Boris Cherny, has publicly described running thousands of such sub-agents overnight, monitored from his phone. Cursor, Replit Agent, and Cognition's Devin operate on similar lines. The work product is no longer a suggestion in the editor; it is a completed branch, a passing test suite, a draft pull request.

This shift changes the operational question. In the autocomplete era, the question was "is this suggestion good?" In the delegation era, the question is "is this branch trustworthy enough to merge after review?" The two questions sound similar; they require very different organisational answers. Anthropic's own report found that even in the most favourable domain — engineering work at AI-native companies — full delegation still sits at the lower end of the zero-to-twenty-percent range. The capability is real and the deployment overhang is also real. Most organisations are using systems that are more capable than their review processes currently allow them to be.

The regulated edge

For technology leaders in Switzerland, and particularly in the financial services, pharma, insurance, and medical-technology sectors that anchor the Romandy economy, the AI coding question carries an additional set of considerations that the global literature only partially captures.

The first is data flow. AI coding tools, in their default configurations, transmit code to provider-hosted infrastructure outside Switzerland and outside the European Union. Each of the major tools — GitHub Copilot, Cursor, Claude Code, Codeium — now offers enterprise tiers with stricter data residency, retention, and training-opt-out guarantees. The default tier does not. For code that touches regulated data, regulated processes, or trade secrets, the configuration is not a footnote.

The second is auditability under the EU AI Act, whose binding date for high-risk Annex III systems is 2 August 2026.¹¹ AI-generated code that is incorporated into a high-risk AI system inherits the conformity-assessment, documentation, and human-oversight obligations of that system. Provenance — knowing which segments of a codebase were AI-generated, which model produced them, and what review they received — is no longer a documentation nicety. For an increasing share of regulated software, it will be a compliance artefact.

The third is the standardisation track. ISO/IEC 42001:2023, the AI management system standard, is becoming a procurement gate for European enterprise buyers. Engineering organisations that can demonstrate disciplined practices around AI-assisted development — provenance, review, secure-coding boundaries, training-data isolation — will find themselves with a structural advantage in regulated sales cycles over the next two years.

What this means operationally

The patterns above suggest a practical posture rather than a set of rules, and the posture is closer to "treat AI coding as supervised production capability" than to either "ban it" or "deploy it everywhere."

The supervised-production posture has roughly five components. Treat AI-generated code under the same review standard as human-written code, with the additional requirement that the submitting developer be able to explain every line they ship, regardless of authorship. Keep security-sensitive paths — authentication, authorisation, payment, key management, cryptographic primitives, regulated data access — under human authorship and human review by default, with AI assistance permitted only in restricted, well-instrumented forms. Instrument the engineering pipeline to actually measure what AI is doing to it: cycle time, defect escape rate, rework, code churn, review duration, on a quarterly cadence and not on a vibe. Tighten review practices, do not loosen them — the productivity gain from AI is real only if review quality is preserved. And treat provenance as an artefact: which code is AI-generated, which model produced it, what configuration the tool was running under, and what review it received.

The cost of getting this right is mostly cultural and procedural, not financial. The licensing cost of AI coding tools, at roughly twenty to forty US dollars per developer per month for the established tools and somewhat more for the frontier agentic products, is trivial against engineering salaries. The cost of getting it wrong is also mostly cultural: a slow erosion of the habits — careful reading, deep system knowledge, judgement under uncertainty — that produced the engineering quality in the first place. That erosion is not hypothetical, and it is the cost most likely to be invisible on a dashboard until it is expensive on a roadmap.

The summary is unromantic. AI code generation is genuinely useful for mechanical, well-specified, pattern-following work, where it produces measurable and durable productivity gains. It is not yet a shortcut to shipping faster on hard problems, and the public evidence increasingly suggests that treating it as one is associated with measurable downside on delivery throughput, stability, and codebase health. The agentic shift of the past six months is changing the question — from "good suggestion?" to "trustworthy branch?" — and the organisations that adapt their review and governance practices to that new question will compound an advantage that the organisations betting only on faster typing will not.

The intuition to retain is the one that the METR study quietly demonstrated. The feeling of going faster is not the same thing as going faster. The teams that distinguish between the two, and measure honestly, will be the ones who actually do.

References & sources

Show all 11 sources

METR (Becker et al.), Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity. July 2025. Randomised controlled trial: 16 developers, 246 real tasks. Expected speedup: +24%. Post-experiment self-report: +20%. Measured: −19%.
GitHub, Quantifying GitHub Copilot's Impact on Developer Productivity and Happiness. 2022, expanded 2024. Randomised study finding 55% faster completion of a bounded scripting task with Copilot vs. control.
DORA, 2024 State of DevOps Report. Increased AI adoption associated with a 1.5% decrease in delivery throughput and a 7.2% decrease in delivery stability per 25% increase in AI tool use. Correlational.
Anthropic, 2026 Agentic Coding Report. AI used in ~60% of coding work in surveyed engineering teams; full delegation 0–20%.
Atlassian, State of Developer Experience 2024. Largest reported AI-tool time savings concentrated in code understanding and navigation rather than code production.
Published incident reports and post-mortems, 2024–2025. Recurring pattern of AI-generated code passing canonical-case tests but failing on regulatory or edge-case logic in production.
GitClear, Coding on Copilot: Code Quality Findings Across 153 Million Lines of Changed Code. 2024. Rise in two-week code churn coinciding with AI tool adoption; correlational.
Faros AI and Codacy, 2025 practitioner studies on PR review cycle time in AI-augmented engineering organisations.
Stack Overflow, Developer Survey 2024. 76% of developers using or planning to use AI tools; 43% trust output accuracy; 2% report high trust.
*Anthropic, Claude Code Swarm Mode (TeammateTool).* Released alongside Claude Opus 4.5, November 2025. Lead agent plans and delegates; specialist agents execute in isolated git worktrees.
EU AI Act, Annex III. Binding compliance date for high-risk AI systems: 2 August 2026. ---

AI code generation: what the evidence actually shows