Cost & ROI
The loops era has a cost problem. Open-source models are 40% of the answer.
Gemma 4's 12B runs on a 16GB laptop. Mistral keeps shipping startlingly good small models. Meanwhile your agents burn frontier tokens on turns a 7B could handle. The gap between those two facts is most of your bill.
Two things are true at once this season. Frontier pricing keeps climbing — Fable 5 launched at $10/$50 per million tokens, double Opus. And open-weight small models quietly crossed the only line that matters: good enough for the turn in front of them. Gemma 4 ships Apache-2.0 in sizes from 2B to 31B, with the 12B running on an ordinary 16GB laptop; Mistral's latest small model folds instruct, reasoning, and coding into one checkpoint with a 256k window. These are no longer toys.
Agents aren't requests. They're loops.
The cost structure of AI changed when the unit of work became the loop: plan → edit → run tests → read the error → retry. A single task is dozens or hundreds of turns, and every turn re-sends the growing context. Now look at what those turns actually are. A handful need real intelligence — the plan, the tricky diagnosis, the design call. The rest are mechanical: parse this traceback, rename that symbol, summarize the diff, decide whether the tests passed. Frontier models charge frontier prices for grep-shaped work.
Paying $50 per million output tokens to read a test log is like hiring a surgeon to take your temperature — on a retainer that bills per minute, in a hospital where temperatures get taken four hundred times a night.
The arithmetic of the inner loop
Take a 200-turn agentic task. Routed entirely to a frontier model, the mechanical middle of that loop dominates the bill. Route those turns to a small open model instead and the line item collapses: hosted open-weight inference runs one to two orders of magnitude cheaper — and on your own hardware, the marginal cost is electricity. A Gemma-4-12B on the laptop your developer already owns prices the inner loop at approximately zero. This is the same subscription-arbitrage logic that makes Claude Max capacity free at the margin, extended one ring further out: the cheapest token is the one served by silicon you already paid for.
The mix, with the third band filled in
We keep drawing the same picture: of 10,000 monthly sessions, roughly 25% earn the frontier model, 35% are fine on mid-tier, and 40% would not get worse on an open model. Until recently that third band was theoretical — the open models weren't reliable enough for unattended turns. Gemma 4 and the current Mistral generation are what changed. The 40% band now has inhabitants, and they cost approximately nothing.
What belongs in it: inner-loop mechanical turns, classification and triage, summarization, retry-and-check cycles, high-volume customer chat, and every cron that's been running on Opus since March because nobody looked.
The missing piece was never the model
It's the routing. Nobody hand-picks a model per turn — which is why everything runs on the expensive default. That's exactly what a routing policy is for: a rule like "inner-loop turns and chat go to the open tier; escalate after two failures; anything tagged production gets the frontier" — written once, enforced on every session. The session layer is model-agnostic by design, so as open models keep improving, upgrading your mix is a policy edit, not a migration. Hosted open-weight endpoints route today; local-first — your agents' inner loops running on your own machines' silicon — is where this is headed, and the economics make it inevitable.
The takeaway
Frontier models will keep winning the hard 25%. But the loops era multiplies the easy 75%, and open source just got good enough to take the biggest slice of it. The teams that wire that routing up this year will quietly run agent fleets at a third of their neighbors' cost — same output, smaller invoice, and a laptop doing what a $50/M model used to.
Make the 40% band real.
See your actual model mix on the dashboard, then set the policy that bends it — rules you write, or one line of auto.