The loops era has a cost problem. Open-source models are 40% of the answer.

Two things are true at once this season. Frontier pricing keeps climbing — Fable 5 launched at $10/$50 per million tokens, double Opus. And open-weight small models quietly crossed the only line that matters: good enough for the turn in front of them. Gemma 4 ships Apache-2.0 in sizes from 2B to 31B, with the 12B running on an ordinary 16GB laptop; Mistral's latest small model folds instruct, reasoning, and coding into one checkpoint with a 256k window. These are no longer toys.

Agents aren't requests. They're loops.

The cost structure of AI changed when the unit of work became the loop: plan → edit → run tests → read the error → retry. A single task is dozens or hundreds of turns, and every turn re-sends the growing context. Now look at what those turns actually are. A handful need real intelligence — the plan, the tricky diagnosis, the design call. The rest are mechanical: parse this traceback, rename that symbol, summarize the diff, decide whether the tests passed. Frontier models charge frontier prices for grep-shaped work.

Paying $50 per million output tokens to read a test log is like hiring a surgeon to take your temperature — on a retainer that bills per minute, in a hospital where temperatures get taken four hundred times a night.

The arithmetic of the inner loop

Take a 200-turn agentic task. Routed entirely to a frontier model, the mechanical middle of that loop dominates the bill. Route those turns to a small open model instead and the line item collapses: hosted open-weight inference runs one to two orders of magnitude cheaper — and on your own hardware, the marginal cost is electricity. A Gemma-4-12B on the laptop your developer already owns prices the inner loop at approximately zero. This is the same subscription-arbitrage logic that makes Claude Max capacity free at the margin, extended one ring further out: the cheapest token is the one served by silicon you already paid for.

The mix, with the third band filled in

We keep drawing the same picture: of 10,000 monthly sessions, roughly 25% earn the frontier model, 35% are fine on mid-tier, and 40% would not get worse on an open model. Until recently that third band was theoretical — the open models weren't reliable enough for unattended turns. Gemma 4 and the current Mistral generation are what changed. The 40% band now has inhabitants, and they cost approximately nothing.

What belongs in it: inner-loop mechanical turns, classification and triage, summarization, retry-and-check cycles, high-volume customer chat, and every cron that's been running on Opus since March because nobody looked.

The missing piece was never the model

It's the routing. Nobody hand-picks a model per turn — which is why everything runs on the expensive default. That's exactly what a routing policy is for: a rule like "inner-loop turns and chat go to the open tier; escalate after two failures; anything tagged production gets the frontier" — written once, enforced on every session. The session layer is model-agnostic by design, so as open models keep improving, upgrading your mix is a policy edit, not a migration. Hosted open-weight endpoints route today; local-first — your agents' inner loops running on your own machines' silicon — is where this is headed, and the economics make it inevitable.

The takeaway

Frontier models will keep winning the hard 25%. But the loops era multiplies the easy 75%, and open source just got good enough to take the biggest slice of it. The teams that wire that routing up this year will quietly run agent fleets at a third of their neighbors' cost — same output, smaller invoice, and a laptop doing what a $50/M model used to.