Inference: The AI bottleneck you need to be planning for

Essay, 2026-07-01

Every model upgrade quietly changes your unit economics, and almost nobody in the business is watching.

Inference is becoming one of the biggest constraints in AI, and almost nobody outside the frontier labs is planning for it. As adoption spreads across organisations and teams, the gap I keep running into isn't capability. It's cost. People are building genuinely good workflows: agents that plan, execute, and hand off to sub-agents, running on schedules, day after day. Almost none of them are asking what those workflows actually cost to run at scale, or how that cost shifts every time the underlying model changes.

That's not a small oversight. It decides whether AI adoption in a business is durable or a slow-motion accident. You can be excellent at designing a workflow that produces genuinely useful output and still be building a bottleneck, because if nobody in the business understands what that output costs in tokens, the cost only becomes visible once it's already a problem.

Why the economics are about to bite

The frontier labs are heading toward IPOs. Enterprise pricing is being restructured. The VC subsidy that's made frontier model access feel almost free is not going to hold forever. When it lifts, the relationship most people have with AI changes, and not just on the workflow or automation side, and not just in ROI terms. It changes at the level of inference itself: the tokens going in, the tokens coming out, and whether anyone in the business actually understands that trade.

This is the part I don't think has landed yet. Teams have got comfortable measuring AI in terms of what it produces. Very few are measuring it in terms of what it costs to produce, and fewer still are watching how that cost moves with every model release.

The mistake I keep seeing

The pattern is people getting excited about the newest, most powerful model the moment it ships, and setting it as the default for everything. Take Fable: it can produce genuinely strong output and act as a stand-in for skills a team doesn't have yet. But that power comes at a cost. It's a slow model, hugely token-hungry, and if you set it as the default for everyday work, you're building a cost spiral into the organisation without meaning to.

We're seeing the same pattern with Sonnet 5. It's a near-opus intelligence model, but it's slower than its predecessor, Sonnet 4.6, delegates tasks out to sub-agents in the background, and has reportedly been using significantly more tokens and costing more to run as a result. Each of those properties is invisible until you go looking for it, and each one multiplies the bill.

What good practice actually looks like

Every time a model updates, the right question isn't "is this the best one available." It's "which of the available models is the right one for this specific job." Once a workflow is built, test it against the full range: does a faster, lighter, flash-tier model produce comparable output for less? If the answer is no, can the thinking and planning stay on a stronger model while execution moves to something quicker and cheaper? Most workflows split cleanly along that line once you actually look for it.

Scheduled tasks are where this compounds fastest. A workflow that costs a few pence run once by a person becomes something else entirely when it's set to run on a schedule, multiplied across every hour, every day, without anyone reviewing the bill until it lands. That's the exact mechanism by which a single well-intentioned automation quietly becomes the most expensive thing in the business.

None of that works if people don't understand what they're spending. Employees need to know how input tokens work, how output tokens work, and how scheduled or agentic tasks magnify cost far faster than a single one-off prompt does. They need to see, with every model release, what the new pricing actually does to the business, not assume that better intelligence automatically means better value.

Inference is where AI economics gets real. The organisations paying attention to it now are the ones still running these workflows profitably in two years. The rest are going to find out the hard way, one invoice at a time.

More field notes