The State of AI-Coding Spend 2026

The headline numbers

$500–2k

per dev / month on AI coding at heavy teams (reported)

40–85%

cost cut from model routing in benchmarks; ~30% realistic

5–8×

cheaper to run a capable small model vs the frontier one

84%

of devs use AI tools — but trust fell 40% → 29%

37%

of enterprises run 5+ models in production (up from 29%)

Claude model IDs retired in 12 months — a re-eval treadmill

Every number here is reported or modeled — the best anyone can do from the outside. Cerver measures your own team's verified spend from real tokens — measured, not surveyed. See your verified number →

1. The market: spend went from rounding error to line item

Enterprise generative-AI spend hit $37B in 2025, up 3.2× from $11.5B in 2024. Foundation-model API spend alone reached $12.5B. Coding is the single largest enterprise category — roughly $4B — and Anthropic, which leads it, now takes 40% of enterprise LLM spend (up from 12% in 2023) while OpenAI's share fell to 27%.

Metric	Value	Source
Enterprise gen-AI spend, 2025	$37B (3.2× YoY)	Menlo Ventures
Foundation-model API spend, 2025	$12.5B (2.0× YoY)	Menlo Ventures
Enterprise LLM API spend, ~6 mo	$3.5B → $8.4B	Menlo Ventures
AI-coding category size	~$4B	Menlo Ventures
Enterprise AI-coding-agent market	~$9.8–11B annualized	Gartner
Expected enterprise AI-budget growth, next 12 mo	~75% YoY	a16z CIO survey
Budget that's still "experimental"	25% → 7%	a16z CIO survey

Market-size scopes differ by source; figures are reported or analyst estimates as noted in the sources.

2. What teams actually pay — per seat, then per token

The subscription is the tip. Tools cluster around a $10–20 individual / $30–40 business-seat / $200 power-user pattern — but seats are increasingly a floor, not the bill, as the whole category shifts to usage-based pricing.

Tool	Key tiers ($/mo)	Users	ARR / run-rate
GitHub Copilot	Pro $10 · Business $19 · Enterprise $39	~20M (4.7M paid)	—
Cursor	Pro $20 · Business $40 · Enterprise custom	>1M paying	$2B (→ $6B fcst)
Claude Code	Pro $20 · Max $100/$200 · Team ~$100/seat	—	~$2.5B
OpenAI Codex	Plus $20 · Business $30 · Pro $200	5M weekly	—
Windsurf	Pro $20 · Teams $40 · Max $200	350 ent. accts	$82M (acquired)
Devin (Cognition)	Core $20 +usage · Team $500	—	~$1B est.
Zed	Pro $10 · Business $30	—	—

The 2026 structural shift: GitHub Copilot moved to usage-based "AI credits" on June 1, 2026 (1 credit = $0.01, billed per token), and Claude Code, Codex, and Devin all stack token/ACU usage on top of seats. Flat seat prices now understate real cost — at scale, per-developer spend commonly runs $100–200+/mo, and reported heavy-user figures reach $500–$2,000/dev/mo.

3. Why the bill is unpredictable: the price spread is enormous

Agentic coding is token-hungry — every turn re-sends a growing context window plus tool output. But the bigger driver of cost variance is which model the tokens hit. Here are first-party list prices, mid-2026:

Model	Input $/1M	Output $/1M	Cached in $/1M
Claude Opus 4.8 (frontier)	5.00	25.00	0.50
Claude Sonnet 4.6	3.00	15.00	0.30
Claude Haiku 4.5 (cheap)	1.00	5.00	0.10
GPT-5.5 (frontier)	5.00	30.00	0.50
GPT-5 mini (cheap)	0.75	4.50	0.075
Gemini 3 Pro	2.00	12.00	0.20
Gemini 3 Flash-Lite	0.25	1.50	0.025
Grok 4.3	1.25	2.50	~0.20
DeepSeek V4 (budget)	0.14	0.28	0.003

First-party API list prices, standard tier, mid-2026. Cached-input = cache-read rate. Estimates noted in sources.

The leverage is in those gaps. Within one provider family, the capable cheap model is 5–8× cheaper on both input and output than the frontier model. Route all the way down to a budget model and the spread explodes:

Routing move	Input ×cheaper	Output ×cheaper
Haiku 4.5 vs Opus 4.8 (same family)	5×	5×
GPT-5 mini vs GPT-5.5 (same family)	6.7×	6.7×
Gemini Flash-Lite vs GPT-5.5	20×	20×
DeepSeek V4 vs GPT-5.5 (full lever)	~36×	~107×

Same task, same token count — a 5× to 100× price difference, decided by a routing choice almost nobody is making deliberately. That's why a 10-dev team's monthly bill can credibly land anywhere from ~$2,600 to over $15,000.

"Tiered model routing. Simple queries go to Haiku, complex reasoning goes to Opus. Most requests don't need the expensive model."— developer, r/AI_Agents, on cutting API spend ~40%

4. The good news: 30–40% of the spend is avoidable

Three levers, repeatedly validated by practitioners. The first does most of the work:

Lever	Reported saving	Realistic / note
Model routing	40–85%	up to 85% w/ ~95% quality (RouteLLM); ~30–40% typical
Semantic caching	40–80%	20–45% real hit rate; 18% exact + 47% near-dup queries
Prompt compression	4–20×	tokens cut; ~1.5-pt accuracy drop (LLMLingua)
Prompt caching (built-in)	~10× / token	cached input ≈ 10% of input price; 80% hit ≈ 5× cheaper session

The honest number to plan on is ~30% — vendor maxima (85%, 80%) come from benchmark-favorable conditions; production cache-hit rates run 20–45%. But the point stands: the biggest lever is simply not paying flagship rates for routine work. That's a routing problem, and routing requires seeing spend at the task level — which almost no team can today.

The demand is real and tooled: LiteLLM has ~49k GitHub stars; OpenRouter routes ~25 trillion tokens/week (programming is now >50% of that volume).

5. The reliability tax: paying more, trusting less

Spend rose; trust fell. 84% of developers use AI tools, but trust in their accuracy dropped from ~40% to 29%; 46% actively distrust accuracy, and 66% cite "almost right but not quite" as their top frustration. The benchmarks don't help: a peer-reviewed study (ICSE 2026) found that among "solved" SWE-bench issues, 7.8% of patches pass tests but fail the developer's own suite, 29.6% behave differently from ground truth, and reported resolution rates are inflated by 6.2 points.

Reliability / churn metric	Value	Source
Trust in AI accuracy	40% → 29%	Stack Overflow 2025
"Solved" SWE-bench patches that are wrong	~30% diverge	arXiv 2503.15223
Frontier-model release cadence (industry median)	~11–17 days	Epoch AI / aggregator
Claude deprecation events (12 mo)	6 events / ~8 IDs	Anthropic docs
Deprecation notice period	≥60 days (Anthropic)	Anthropic docs

With a new frontier model roughly every two weeks and ~8 Claude IDs retired in a year, teams are on a permanent, involuntary re-qualification treadmill — and because degradations are usually pushed at the runtime layer with no version bump, most teams can't even tell when quality moved.

The takeaway: the missing unit is the session

You don't govern AI spend with a ceiling — Uber tried, capping engineers at $1,500/mo, and it caps the valuable spend along with the waste. You govern it with attribution, and the right unit is the session: one task, one transcript, with its model, compute, and cost bound together.

Make every dollar explainable at the session level and three things become possible: route routine work to a model 5–8× cheaper, compare models on the same prompt instead of guessing, and see the cost of each session so you cut waste without capping the work that's working. The question shifts from "how do we spend less?" to "which spend is working?" — which is what a team that wants more AI output should actually be asking. That's the bet behind cerver.