The headline numbers
1. The market: spend went from rounding error to line item
Enterprise generative-AI spend hit $37B in 2025, up 3.2× from $11.5B in 2024. Foundation-model API spend alone reached $12.5B. Coding is the single largest enterprise category — roughly $4B — and Anthropic, which leads it, now takes 40% of enterprise LLM spend (up from 12% in 2023) while OpenAI's share fell to 27%.
| Metric | Value | Source |
|---|---|---|
| Enterprise gen-AI spend, 2025 | $37B (3.2× YoY) | Menlo Ventures |
| Foundation-model API spend, 2025 | $12.5B (2.0× YoY) | Menlo Ventures |
| Enterprise LLM API spend, ~6 mo | $3.5B → $8.4B | Menlo Ventures |
| AI-coding category size | ~$4B | Menlo Ventures |
| Enterprise AI-coding-agent market | ~$9.8–11B annualized | Gartner |
| Expected enterprise AI-budget growth, next 12 mo | ~75% YoY | a16z CIO survey |
| Budget that's still "experimental" | 25% → 7% | a16z CIO survey |
Market-size scopes differ by source; figures are reported or analyst estimates as noted in the sources.
2. What teams actually pay — per seat, then per token
The subscription is the tip. Tools cluster around a $10–20 individual / $30–40 business-seat / $200 power-user pattern — but seats are increasingly a floor, not the bill, as the whole category shifts to usage-based pricing.
| Tool | Key tiers ($/mo) | Users | ARR / run-rate |
|---|---|---|---|
| GitHub Copilot | Pro $10 · Business $19 · Enterprise $39 | ~20M (4.7M paid) | — |
| Cursor | Pro $20 · Business $40 · Enterprise custom | >1M paying | $2B (→ $6B fcst) |
| Claude Code | Pro $20 · Max $100/$200 · Team ~$100/seat | — | ~$2.5B |
| OpenAI Codex | Plus $20 · Business $30 · Pro $200 | 5M weekly | — |
| Windsurf | Pro $20 · Teams $40 · Max $200 | 350 ent. accts | $82M (acquired) |
| Devin (Cognition) | Core $20 +usage · Team $500 | — | ~$1B est. |
| Zed | Pro $10 · Business $30 | — | — |
The 2026 structural shift: GitHub Copilot moved to usage-based "AI credits" on June 1, 2026 (1 credit = $0.01, billed per token), and Claude Code, Codex, and Devin all stack token/ACU usage on top of seats. Flat seat prices now understate real cost — at scale, per-developer spend commonly runs $100–200+/mo, and reported heavy-user figures reach $500–$2,000/dev/mo.
3. Why the bill is unpredictable: the price spread is enormous
Agentic coding is token-hungry — every turn re-sends a growing context window plus tool output. But the bigger driver of cost variance is which model the tokens hit. Here are first-party list prices, mid-2026:
| Model | Input $/1M | Output $/1M | Cached in $/1M |
|---|---|---|---|
| Claude Opus 4.8 (frontier) | 5.00 | 25.00 | 0.50 |
| Claude Sonnet 4.6 | 3.00 | 15.00 | 0.30 |
| Claude Haiku 4.5 (cheap) | 1.00 | 5.00 | 0.10 |
| GPT-5.5 (frontier) | 5.00 | 30.00 | 0.50 |
| GPT-5 mini (cheap) | 0.75 | 4.50 | 0.075 |
| Gemini 3 Pro | 2.00 | 12.00 | 0.20 |
| Gemini 3 Flash-Lite | 0.25 | 1.50 | 0.025 |
| Grok 4.3 | 1.25 | 2.50 | ~0.20 |
| DeepSeek V4 (budget) | 0.14 | 0.28 | 0.003 |
First-party API list prices, standard tier, mid-2026. Cached-input = cache-read rate. Estimates noted in sources.
The leverage is in those gaps. Within one provider family, the capable cheap model is 5–8× cheaper on both input and output than the frontier model. Route all the way down to a budget model and the spread explodes:
| Routing move | Input ×cheaper | Output ×cheaper |
|---|---|---|
| Haiku 4.5 vs Opus 4.8 (same family) | 5× | 5× |
| GPT-5 mini vs GPT-5.5 (same family) | 6.7× | 6.7× |
| Gemini Flash-Lite vs GPT-5.5 | 20× | 20× |
| DeepSeek V4 vs GPT-5.5 (full lever) | ~36× | ~107× |
Same task, same token count — a 5× to 100× price difference, decided by a routing choice almost nobody is making deliberately. That's why a 10-dev team's monthly bill can credibly land anywhere from ~$2,600 to over $15,000.
4. The good news: 30–40% of the spend is avoidable
Three levers, repeatedly validated by practitioners. The first does most of the work:
| Lever | Reported saving | Realistic / note |
|---|---|---|
| Model routing | 40–85% | up to 85% w/ ~95% quality (RouteLLM); ~30–40% typical |
| Semantic caching | 40–80% | 20–45% real hit rate; 18% exact + 47% near-dup queries |
| Prompt compression | 4–20× | tokens cut; ~1.5-pt accuracy drop (LLMLingua) |
| Prompt caching (built-in) | ~10× / token | cached input ≈ 10% of input price; 80% hit ≈ 5× cheaper session |
The honest number to plan on is ~30% — vendor maxima (85%, 80%) come from benchmark-favorable conditions; production cache-hit rates run 20–45%. But the point stands: the biggest lever is simply not paying flagship rates for routine work. That's a routing problem, and routing requires seeing spend at the task level — which almost no team can today.
The demand is real and tooled: LiteLLM has ~49k GitHub stars; OpenRouter routes ~25 trillion tokens/week (programming is now >50% of that volume).
5. The reliability tax: paying more, trusting less
Spend rose; trust fell. 84% of developers use AI tools, but trust in their accuracy dropped from ~40% to 29%; 46% actively distrust accuracy, and 66% cite "almost right but not quite" as their top frustration. The benchmarks don't help: a peer-reviewed study (ICSE 2026) found that among "solved" SWE-bench issues, 7.8% of patches pass tests but fail the developer's own suite, 29.6% behave differently from ground truth, and reported resolution rates are inflated by 6.2 points.
| Reliability / churn metric | Value | Source |
|---|---|---|
| Trust in AI accuracy | 40% → 29% | Stack Overflow 2025 |
| "Solved" SWE-bench patches that are wrong | ~30% diverge | arXiv 2503.15223 |
| Frontier-model release cadence (industry median) | ~11–17 days | Epoch AI / aggregator |
| Claude deprecation events (12 mo) | 6 events / ~8 IDs | Anthropic docs |
| Deprecation notice period | ≥60 days (Anthropic) | Anthropic docs |
With a new frontier model roughly every two weeks and ~8 Claude IDs retired in a year, teams are on a permanent, involuntary re-qualification treadmill — and because degradations are usually pushed at the runtime layer with no version bump, most teams can't even tell when quality moved.
The takeaway: the missing unit is the session
You don't govern AI spend with a ceiling — Uber tried, capping engineers at $1,500/mo, and it caps the valuable spend along with the waste. You govern it with attribution, and the right unit is the session: one task, one transcript, with its model, compute, and cost bound together.
Make every dollar explainable at the session level and three things become possible: route routine work to a model 5–8× cheaper, compare models on the same prompt instead of guessing, and see the cost of each session so you cut waste without capping the work that's working. The question shifts from "how do we spend less?" to "which spend is working?" — which is what a team that wants more AI output should actually be asking. That's the bet behind cerver.