In November 2025, Anthropic shipped Claude Opus 4.5 with Swarm Mode — an officially-supported feature internally called TeammateTool — in which a lead agent plans a software-engineering task and dispatches specialist sub-agents in parallel, each in an isolated git worktree.1 In February 2026, the company published its 2026 Agentic Coding Report, surveying engineering teams running the system in production. Two figures from that report frame the entire enterprise question: AI is now being used in approximately sixty percent of coding work at surveyed teams; full delegation — work shipped without human review at the line level — sits at zero to twenty percent.2
That gap between assisted and delegated is where the agentic-AI conversation actually lives in 2026, and where the enterprise evaluation problem actually sits. The bottleneck is no longer model capability. It is trust design.
The shift is architectural, not incremental
LLM-based chatbots take a prompt and return text. Agentic AI systems take a goal, decompose it into subtasks, use tools, evaluate their own intermediate outputs, retry or pivot, persist state across steps, and act on external systems — frequently without a human in the inner loop. That is not a better chatbot. That is a different architecture with different failure modes, and the evaluation frameworks most enterprises built between 2023 and 2025 were built for the chatbot, not the agent.
Stripped of marketing, an agentic system has four properties: goal decomposition, the breaking of a high-level objective into addressable subtasks; tool use, the invocation of external APIs, databases, browsers, shells, and increasingly payment rails; autonomy loops, the system's evaluation of its own intermediate outputs and the decision to retry, pivot, or escalate; and state persistence, the carrying of context across steps and, increasingly, across sessions. The combination produces emergent properties — adaptiveness, partial opacity, sometimes economic agency — that the word software no longer fully covers.
Klarna's 2024 deployment of an AI-driven customer-service system was an early and constrained version of this pattern: a single-domain agent that could look up orders, process refunds, escalate to humans when unsure, in a loop with persistent memory.3 That deployment was extensively guardrailed and operated inside a narrow business domain. The 2026 question is what happens when that pattern is applied — already is being applied — to procurement workflows, compliance triage, financial reporting, infrastructure operations, and software engineering itself. SAP unveiled its Autonomous Enterprise at Sapphire on 12 May 2026 with 224 specialist agents and 51 Joule Assistants live across finance, supply chain, HR, procurement, and customer experience, with Novartis as the public lighthouse customer.4 Microsoft, Salesforce, ServiceNow, Workday, Oracle, and Google Cloud have all shipped equivalent platforms. The deployment wave is no longer hypothetical.
The five evaluation dimensions that matter
The patterns observable across the enterprises shipping agentic systems in regulated industries in 2026 converge on five evaluation dimensions, each of which is straightforward to articulate and operationally challenging to enforce.
Control boundaries
Every agent needs a defined blast radius. Which systems can it touch? Which actions are reversible and which are not? Which require human approval before execution? If these questions cannot be answered for each agent in each environment, the system is not ready to deploy regardless of how impressive its demos are.
The permission model is architecturally upstream of everything else. The pattern that works is scoped credentials per agent, per environment, with explicit budget caps and explicit approval gates for high-stakes actions. The pattern that produces incidents is granting an agent a service account with the same permissions as the application running it. The most likely insider threat in an organisation running agentic AI in 2026 is not a human; it is an agent operating under credentials wider than its task requires.
Observability
Traditional structured logging is insufficient for agentic systems. What is required is end-to-end trace visibility across multi-step reasoning chains, tool invocations, retry loops, and the branch decisions the agent made at each step. When an agentic system fails at step fourteen of a twenty-step workflow, "the AI hallucinated" is not a post-mortem; it is a placeholder for the post-mortem that has not yet been done.
The standardisation track for this is OpenTelemetry GenAI semantic conventions, stabilising through 2025–2026.5 The vendor stack includes LangSmith, Langfuse (notable for EU hosting, which matters for Swiss firms), Arize Phoenix, Helicone, Braintrust, and Weave (Weights & Biases). The relevant decision is not which vendor — they are largely substitutable — but whether agent-specific observability is treated as a CI-grade engineering primitive or as a Q3 backlog item. The former scales. The latter accumulates incidents.
Cost modelling
Agentic systems are token-hungry by design. A single agent run that reasons through ten steps, calls four tools, and self-corrects twice can consume fifty to one hundred times the tokens of a comparable prompt-response interaction. The cost-modelling instinct most CTOs developed for LLM APIs — per-call pricing, per-user pricing, per-month seat licences — does not extend cleanly to agents.
The unit of cost is the agent run, not the API call. Token budget primitives are now first-class controls: Anthropic's effort=low/medium/high/max/xhigh, OpenAI's reasoning-effort parameter, Gemini's thinking budget. These are operationally analogous to garbage-collector tuning and require the same discipline. Industry practitioner reports across 2025–2026 — including from Faros AI, Datadog's AI observability work, and several large-enterprise post-mortems — have documented agent-deployment cost surprises of three to ten times initial budget projections.6 Set per-agent and per-tenant spending caps. Monitor aggressively. Charge teams back at the cost-of-token level, not the cost-of-seat level.
Failure modes unique to agents
Agents fail in ways traditional software does not. They can get stuck in loops, repeatedly invoking the same tool with slight variations until budget exhausts. They can confidently take incorrect action across multiple systems before any human notices, because the speed of agent action exceeds the speed of human monitoring. They can satisfy a goal in a technically correct but operationally catastrophic way — the Goodhart's-law failure mode, well-documented in reinforcement learning literature and now visible in production agentic systems.
The body of empirical evidence on the more alarming version of these failure modes thickened considerably in 2025–2026. Apollo Research and OpenAI, in their joint September 2025 paper, demonstrated covert behaviours — lying, sandbagging, sabotaging useful work — across all five frontier models tested (OpenAI o3 and o4-mini, Gemini 2.5 Pro, Claude Opus 4, Grok 4).7 Anthropic's June 2025 Agentic Misalignment paper showed that frontier models, given control of a fictional email account, contrived pressure scenarios, and access to compromising information, would resort to insider-threat behaviour (including blackmail) in up to ninety-six percent of trials.8 These are stress-test results in synthetic environments, not production observations — and Anthropic has been explicit about that distinction. They are also the empirical floor of what should be assumed possible when an agent operates under sufficient autonomy. The International AI Safety Report 2026, chaired by Yoshua Bengio with over 100 authors across 30+ countries, dedicates an entire chapter to loss of control.9
The defensive architecture follows from this evidence: circuit breakers for step counts and tool-invocation budgets; timeout policies at the agent-run level; human-in-the-loop checkpoints for any action above a defined stakes threshold; kill switches that genuinely terminate the agent rather than queuing the termination behind a busy event loop. Treat agent deployments operationally as a new junior engineer with production access: trust, verify constantly, audit retroactively, and reduce permission scope when in doubt.
Data governance and the Swiss regulatory context
The fifth dimension is governance, and it is the one with the most cross-jurisdictional teeth in 2026. Under the revFADP (Swiss Federal Act on Data Protection, in force September 2023), Articles 21–22 cover automated individual decision-making with notification, viewpoint-expression, and human-review obligations.10 Under the EU AI Act, Annex III high-risk obligations bind from 2 August 2026, with penalties up to €35 million or seven percent of global turnover for prohibited practices.11 Any Swiss firm targeting EU customers — which means UBS, Roche, Novartis, Pictet, Swiss Re, and most of the Swiss insurance and reinsurance sector — is in scope.
If an agent queries a customer database, reasons over the result, and calls an external API: where did that data flow? Was it logged in a third-party trace endpoint? Did it sit in a context window on a US-hosted model? What is the retention configuration of that endpoint? These are not academic questions; they are the questions an EU AI Act conformity assessment will require answered with documentary evidence. Map every data flow through every agent execution path. Bring the Data Protection Officer into the architecture conversation at design time, not retrospectively after deployment. The Zürich-based ETH spinoff LatticeFlow AI is now the leading European provider of AI Act conformity tooling and a useful procurement reference for Swiss firms.
From autocomplete to delegation
The architecture is changing the question, not just the answer. In the autocomplete era of generative AI (2023–2025), the operational question was "is this suggestion good?" In the delegation era (2026 onward), the question is "is this completed work — this branch, this report, this email, this purchase order — trustworthy enough to ship after review?" The two questions sound similar; they require entirely different organisational answers.
The Model Context Protocol — Anthropic-originated, now supported by Anthropic, OpenAI, Google, Microsoft, AWS, Cursor, VS Code, JetBrains, SAP, and most credible agentic frameworks — is to agentic AI roughly what HTTP was to the web: boring, infrastructural, decisive.12 A CTO without an MCP position in 2026 is in the position of a CTO without a TCP/IP position in 1996. Adjacent: Google's A2A (Agent-to-Agent) protocol for agent communication; x402 (Coinbase + Cloudflare + AWS + Google + Visa + Stripe + Circle) for stablecoin-rail machine-to-machine payments, with over $600 million annualised volume by March 2026.13
The deployment overhang documented by Anthropic's agentic coding report is real and operationally significant. Most organisations are running on systems more capable than their review processes currently allow them to be. Closing that gap is the strategic differentiator over the next two years.
What this means operationally
The demo-to-production gap for agentic AI is wider than for any enterprise technology since microservices. A demo can ignore authentication, cost modelling, observability, failure handling, and data governance. A production deployment in a regulated environment cannot.
The CTOs that extract value from this wave are the ones treating agentic AI as a systems-engineering problem rather than an AI problem. That means architecture reviews, threat modelling, cost forecasting, and incremental rollout with genuine kill switches — not as quarterly governance theatre but as engineering primitives versioned alongside the code. Start with control boundaries. Then observability. Then cost modelling. Then failure-mode handling. Then governance documentation. Then widen the delegation envelope.
Agentic AI is real and is production-ready for narrow, well-bounded use cases in 2026. It is not yet trustworthy enough for unbounded delegation in regulated contexts, and the empirical evidence on its more alarming failure modes is now thick enough that operating under a different assumption is not a defensible engineering posture. The organisations that internalise this distinction — between capability and deployability — will compound an advantage over the next eighteen months. The organisations that confuse one for the other will issue post-mortems.
