Benchmark · 2026

The State of AI-Coding Spend 2026

AI coding went from a line item to a budget. This is what teams actually spend — with the model prices, tool prices, and savings math behind every number — and why nobody can yet tell you which dollars were worth it.

The headline numbers

$500–2k
per dev / month on AI coding at heavy teams (reported)
40–85%
cost cut from model routing in benchmarks; ~30% realistic
5–8×
cheaper to run a capable small model vs the frontier one
84%
of devs use AI tools — but trust fell 40% → 29%
37%
of enterprises run 5+ models in production (up from 29%)
~8
Claude model IDs retired in 12 months — a re-eval treadmill
Every number here is reported or modeled — the best anyone can do from the outside. Cerver measures your own team's verified spend from real tokens — measured, not surveyed. See your verified number →

1. The market: spend went from rounding error to line item

Enterprise generative-AI spend hit $37B in 2025, up 3.2× from $11.5B in 2024. Foundation-model API spend alone reached $12.5B. Coding is the single largest enterprise category — roughly $4B — and Anthropic, which leads it, now takes 40% of enterprise LLM spend (up from 12% in 2023) while OpenAI's share fell to 27%.

MetricValueSource
Enterprise gen-AI spend, 2025$37B (3.2× YoY)Menlo Ventures
Foundation-model API spend, 2025$12.5B (2.0× YoY)Menlo Ventures
Enterprise LLM API spend, ~6 mo$3.5B → $8.4BMenlo Ventures
AI-coding category size~$4BMenlo Ventures
Enterprise AI-coding-agent market~$9.8–11B annualizedGartner
Expected enterprise AI-budget growth, next 12 mo~75% YoYa16z CIO survey
Budget that's still "experimental"25% → 7%a16z CIO survey

Market-size scopes differ by source; figures are reported or analyst estimates as noted in the sources.

2. What teams actually pay — per seat, then per token

The subscription is the tip. Tools cluster around a $10–20 individual / $30–40 business-seat / $200 power-user pattern — but seats are increasingly a floor, not the bill, as the whole category shifts to usage-based pricing.

ToolKey tiers ($/mo)UsersARR / run-rate
GitHub CopilotPro $10 · Business $19 · Enterprise $39~20M (4.7M paid)
CursorPro $20 · Business $40 · Enterprise custom>1M paying$2B (→ $6B fcst)
Claude CodePro $20 · Max $100/$200 · Team ~$100/seat~$2.5B
OpenAI CodexPlus $20 · Business $30 · Pro $2005M weekly
WindsurfPro $20 · Teams $40 · Max $200350 ent. accts$82M (acquired)
Devin (Cognition)Core $20 +usage · Team $500~$1B est.
ZedPro $10 · Business $30

The 2026 structural shift: GitHub Copilot moved to usage-based "AI credits" on June 1, 2026 (1 credit = $0.01, billed per token), and Claude Code, Codex, and Devin all stack token/ACU usage on top of seats. Flat seat prices now understate real cost — at scale, per-developer spend commonly runs $100–200+/mo, and reported heavy-user figures reach $500–$2,000/dev/mo.

3. Why the bill is unpredictable: the price spread is enormous

Agentic coding is token-hungry — every turn re-sends a growing context window plus tool output. But the bigger driver of cost variance is which model the tokens hit. Here are first-party list prices, mid-2026:

ModelInput $/1MOutput $/1MCached in $/1M
Claude Opus 4.8 (frontier)5.0025.000.50
Claude Sonnet 4.63.0015.000.30
Claude Haiku 4.5 (cheap)1.005.000.10
GPT-5.5 (frontier)5.0030.000.50
GPT-5 mini (cheap)0.754.500.075
Gemini 3 Pro2.0012.000.20
Gemini 3 Flash-Lite0.251.500.025
Grok 4.31.252.50~0.20
DeepSeek V4 (budget)0.140.280.003

First-party API list prices, standard tier, mid-2026. Cached-input = cache-read rate. Estimates noted in sources.

The leverage is in those gaps. Within one provider family, the capable cheap model is 5–8× cheaper on both input and output than the frontier model. Route all the way down to a budget model and the spread explodes:

Routing moveInput ×cheaperOutput ×cheaper
Haiku 4.5 vs Opus 4.8 (same family)
GPT-5 mini vs GPT-5.5 (same family)6.7×6.7×
Gemini Flash-Lite vs GPT-5.520×20×
DeepSeek V4 vs GPT-5.5 (full lever)~36×~107×

Same task, same token count — a 5× to 100× price difference, decided by a routing choice almost nobody is making deliberately. That's why a 10-dev team's monthly bill can credibly land anywhere from ~$2,600 to over $15,000.

"Tiered model routing. Simple queries go to Haiku, complex reasoning goes to Opus. Most requests don't need the expensive model."— developer, r/AI_Agents, on cutting API spend ~40%

4. The good news: 30–40% of the spend is avoidable

Three levers, repeatedly validated by practitioners. The first does most of the work:

LeverReported savingRealistic / note
Model routing40–85%up to 85% w/ ~95% quality (RouteLLM); ~30–40% typical
Semantic caching40–80%20–45% real hit rate; 18% exact + 47% near-dup queries
Prompt compression4–20×tokens cut; ~1.5-pt accuracy drop (LLMLingua)
Prompt caching (built-in)~10× / tokencached input ≈ 10% of input price; 80% hit ≈ 5× cheaper session

The honest number to plan on is ~30% — vendor maxima (85%, 80%) come from benchmark-favorable conditions; production cache-hit rates run 20–45%. But the point stands: the biggest lever is simply not paying flagship rates for routine work. That's a routing problem, and routing requires seeing spend at the task level — which almost no team can today.

The demand is real and tooled: LiteLLM has ~49k GitHub stars; OpenRouter routes ~25 trillion tokens/week (programming is now >50% of that volume).

5. The reliability tax: paying more, trusting less

Spend rose; trust fell. 84% of developers use AI tools, but trust in their accuracy dropped from ~40% to 29%; 46% actively distrust accuracy, and 66% cite "almost right but not quite" as their top frustration. The benchmarks don't help: a peer-reviewed study (ICSE 2026) found that among "solved" SWE-bench issues, 7.8% of patches pass tests but fail the developer's own suite, 29.6% behave differently from ground truth, and reported resolution rates are inflated by 6.2 points.

Reliability / churn metricValueSource
Trust in AI accuracy40% → 29%Stack Overflow 2025
"Solved" SWE-bench patches that are wrong~30% divergearXiv 2503.15223
Frontier-model release cadence (industry median)~11–17 daysEpoch AI / aggregator
Claude deprecation events (12 mo)6 events / ~8 IDsAnthropic docs
Deprecation notice period≥60 days (Anthropic)Anthropic docs

With a new frontier model roughly every two weeks and ~8 Claude IDs retired in a year, teams are on a permanent, involuntary re-qualification treadmill — and because degradations are usually pushed at the runtime layer with no version bump, most teams can't even tell when quality moved.

The takeaway: the missing unit is the session

You don't govern AI spend with a ceiling — Uber tried, capping engineers at $1,500/mo, and it caps the valuable spend along with the waste. You govern it with attribution, and the right unit is the session: one task, one transcript, with its model, compute, and cost bound together.

Make every dollar explainable at the session level and three things become possible: route routine work to a model 5–8× cheaper, compare models on the same prompt instead of guessing, and see the cost of each session so you cut waste without capping the work that's working. The question shifts from "how do we spend less?" to "which spend is working?" — which is what a team that wants more AI output should actually be asking. That's the bet behind cerver.

See your number — or add it to the next edition.

Run the calculator on your team in ten seconds, or take the 4-minute survey — we'll send you the next benchmark plus where your spend lands vs. teams like yours.

Methodology & sources. Synthesizes first-party pricing pages (Anthropic, OpenAI, Google, xAI, DeepSeek), company announcements, peer-reviewed and arXiv studies, and developer/market surveys, gathered mid-2026. Model prices are standard-tier list prices and move often; the cheap-vs-frontier ratios are the durable takeaway. The Uber per-engineer figures originate with The Information via secondary reporting — treat as reported, not confirmed. The 10-dev and cost-range figures are an illustrative model for cache-heavy agentic usage, not a measurement of a specific company. Savings ranges cite both vendor maxima (benchmark conditions) and realistic production figures; plan on the conservative end. Primary sources: Menlo Ventures "State of Generative AI in the Enterprise" 2025; a16z "100 Enterprise CIOs"; Stack Overflow Developer Survey 2025; Gartner AI Coding Agent Market Guide; OpenRouter "State of AI" (arXiv 2601.10088); RouteLLM (LMSYS, ICLR 2025); LLMLingua (Microsoft Research); SWE-bench correctness study (arXiv 2503.15223, ICSE 2026); Anthropic model-deprecation docs; TechCrunch/CNBC/Sacra for tool ARR; Anthropic engineering postmortem (Sep 2025).
← Cerver blog