A year ago, the board-level question to most European CTOs about AI coding tools was "are your engineers using Copilot?" The implicit framing was that AI coding was a productivity opportunity, a question of adoption rate, a thing to be encouraged downward. In the boardroom conversations now happening across Romandy, Zürich, Frankfurt, and London, the question has changed shape. It is now "how are you governing it?" The shift tells a story about how the discourse has matured, where the risks have surfaced, and what the operationally serious answer now needs to look like.
This essay is the brief version of that answer. A longer companion piece — AI Code Generation: What the Evidence Actually Shows — walks the controlled-study evidence in depth.1 The brief version is for the executive summary slide.
The tools are real. So are the new problems.
The evidence on AI coding tools, after eighteen months of widespread enterprise adoption, points in two directions at once.
On the productivity side, GitHub's controlled study found developers using Copilot completed a bounded scripting task fifty-five percent faster than a control group.2 Industry case studies — Airbnb's testing-infrastructure migration, Google's published reports on AI-assisted code modernisation — document meaningful acceleration of structural-transformation work. Atlassian's 2024 State of Developer Experience research found the largest reported time savings concentrated in code understanding and navigation, not code production.3 For bounded, well-specified, pattern-following work, the productivity gain is real and durable.
On the outcome side, the picture is materially different. METR's mid-2025 randomised controlled trial gave sixteen experienced open-source developers 246 real tasks on codebases they had each contributed to for years. The developers expected, beforehand, that AI would make them twenty-four percent faster. After the experiment they believed they had been twenty percent faster. They were measured at nineteen percent slower.4 DORA's 2024 State of DevOps Report — the longest-running quantitative study of software delivery — found that for every twenty-five percent increase in AI tool adoption, surveyed teams saw a 1.5 percent decrease in delivery throughput and a 7.2 percent decrease in delivery stability.5
Both can be true at once: AI coding tools dramatically accelerate certain categories of work while leaving — or worsening — measured engineering throughput. The bounded productivity gain is real. The unmeasured cost is in review burden, codebase complexity, and the slow erosion of the careful-reading habits that produced engineering quality in the first place. Anthropic's 2026 Agentic Coding Report documents that even in the most favourable domain (engineering at AI-native companies), AI is used in roughly sixty percent of coding work, but fully delegated work sits at zero to twenty percent.6 The gap between assisted and delegated is where the entire governance question now lives.
Where it actually works, in compressed form
Across the controlled and observational evidence, four categories emerge as durable wins. Boilerplate and structural transformation — Terraform modules, REST/GraphQL scaffolding, data class mappings, configuration files, test fixtures. The output requires review; editing a near-correct draft is consistently faster than typing from scratch. Code exploration and navigation — asking the tool to explain a service, find an exception handler, locate the relevant test. This is closer to a faster grep than to authorship. First drafts of complex one-shot artefacts — SQL queries, regex patterns, transformation scripts where the blank-page cost is high. Acceleration of well-defined legacy migrations where the source and target patterns are well-known.
Where it creates problems, in compressed form
The same evidence is unambiguous about three failure categories. Test generation with teeth — AI-generated tests tend to assert against the implementation rather than the behaviour; they look complete, they pass on every run, and they often do not catch the bugs that matter. Coverage is generated; whether the tests assert the correct behaviour remains a human question. Complex domain logic and over-confident completions — anything where correctness depends on understanding why (Swiss franc rounding rules, cross-border tax handling, regulatory edge cases, clinical decision logic, session-management subtleties in authentication code) produces output that looks plausible, passes casual review, and fails in production weeks or months later. The dangerous property is not that the code is wrong; it is that the code looks right. Review fatigue and the hidden tax — when AI generates more code faster, review queues grow. Practitioner studies from Faros AI and Codacy in 2025 documented increases in PR review cycle time on AI-generated code, with the magnitude varying widely but the direction consistent.7 If the review culture was not strong before AI adoption, it becomes the bottleneck after.
What the strongest deployers are doing
The pattern observable across the European enterprises with the most successful AI-coding deployments shares a small set of characteristics. Different review standards for AI-generated code — the same bar, with the additional requirement that the submitting engineer be able to explain every line they ship, regardless of authorship. AI-tool training as part of engineering onboarding — not "here's how to use it" but "here's how it fails, here's how to verify what it generates, here's what the published evidence says about where it is and is not trustworthy." Updated estimation discipline — velocity appears to go up, but the shape of work changes; time saved on generation gets partly absorbed in verification, and the net gain is real but smaller than vendor marketing suggests. Tighter, not looser, review and CI practices — stricter linting, mandatory annotation of AI-generated sections, smaller and more frequent pull requests. The productivity gain is real only if review quality is preserved; loosening review to ride the velocity bump is the most common failure mode and produces the worst long-term outcomes.
The honest number
The published controlled evidence does not support a single industry-wide productivity figure for AI coding tools, and any CTO claiming one is overstating their data. What the evidence supports is a more useful framing: substantial gains in bounded, well-specified tasks; flat or negative gains in unbounded, deeply-contextual tasks; and significant variance across teams, languages, and codebases. The Anthropic deployment-overhang finding (sixty percent assisted, zero-to-twenty percent delegated) is the closest the field has to a defensible aggregate. The CTO who can articulate which categories of their organisation's work fall into the bounded bucket and which fall into the unbounded one is several conversations ahead of the CTO citing the vendor's productivity figure.
What to do now, in operational form
For organisations that have not rolled out AI coding tools: start with a structured pilot on one team and one codebase, with a measurement framework that includes review cycle time, defect escape rate, and rework percentage — not just generation speed. For organisations that have rolled it out: audit what AI-generated tests are actually testing; this category of risk is easy to miss until it is expensive. For organisations being asked by their board whether they are "using AI": the honest answer is not yes or no. The honest answer describes the specific workflows that have changed, what was measured, what improved, what worsened, and what governance was added. That answer is materially more credible than a percentage.
The tools are good enough to use and imperfect enough to respect. The CTOs who internalise both halves of that sentence will compound an advantage over the CTOs who only internalise the first half. The board has already moved on from the first question. The brief above is the start of the answer to the second.
