The Loop Tax: Why AI Agents Are So Expensive

David Crowe · June 26, 2026 · 5 min read

cost agents optimization governance

These figures are from our June 2026 snapshot (210,840 calls). The refreshed July corpus (285,814 calls) shows the same pattern more starkly — 97% of spend in the orchestration loop; see the updated teardown.

The usual explanation for why AI agents are expensive is “frontier models cost more.” It’s not wrong, but it’s not the real driver. We metered 210,840 governed tool calls across our own workspaces, and the cost didn’t come from model prices. It came from a structural feature of how agents work — the loop. In our data, the orchestration loop is 89% of the spend. Here’s the mechanic, and the levers that actually move it.

Leaf calls vs loop calls

A leaf call is what most people picture when they think “LLM call”: one prompt in, one answer out. Summarize this. Classify that. Predictable cost, predictable latency.

An agent is not a leaf. An agent is a loop — the engine of its harness: the model picks a tool, reads the result, picks the next tool, reads that result, and so on until it’s done. The catch is what the model reads each time around. On every turn it re-ingests the growing context — the system prompt, the tool definitions, every prior tool result, the running transcript — before it decides the next step.

So a 12-turn agent isn’t 12 cheap calls. It’s 12 calls where the input grows each turn, and you pay for the whole accumulated context every turn. That re-reading is the loop tax: the same tokens, billed again and again as the transcript snowballs. It’s why the cost of “run an agent” scales with the number of turns, not the number of useful answers.

We tag this distinction on every governed call as callKind — loop (the model orchestrating tools, turn by turn) vs leaf (a single sub-task). When you split a real workload that way, the loop dominates: 89% of spend, and the orchestration step (chat.completion) averages 10.3 seconds — the loop is both the slow part and the expensive part. The single useful answer at the end is a rounding error next to the cost of getting there.

Why a rate limit won’t save you

The instinct, when the bill spikes, is to add a rate limit. It doesn’t help, because a rate limiter caps call volume, not cost per call. A 12-turn loop on a frontier model is comfortably “in budget” by call count and still expensive — each of those turns carries the full re-read tax. Volume-based controls are blind to the one thing that drives agent cost: how big each call got and which model answered it. (We wrote about that failure mode in stop your agent from burning your API budget.)

Three levers that actually move the loop tax

1. Route the loop, not just the leaves

The biggest single lever in our data: which model runs the loop. In the teardown, one frontier model was 80% of the bill across 7.6% of the calls — roughly 114× the per-call cost of the cheap model that did most of the work. Most orchestration turns (“which tool next?”) don’t need a frontier model; they need a fast, cheap one. Put the loop on Flash or Haiku, reserve the expensive model for the leaf calls that genuinely need the reasoning, and the bill drops by an order of magnitude without the agent getting noticeably worse.

2. Cache the stable prefix

The loop tax is worst on the part of the context that doesn’t change — the system prompt and tool definitions are identical on turn 12 and turn 1, yet you re-pay for them every turn. Prompt caching turns that repeated prefix from full price into cache price (cache reads are roughly a tenth of the cost). The catch is knowing whether it’s actually working: ACP meters the cached portion of every call, so “is caching paying off on this agent?” is a number you can read, not a setting you hope is on.

3. Cap the loop, not just the spend

A runaway loop — the agent that gets stuck retrying, or wanders off-task — is the tail that turns a $0.04 run into a $4 one. The fix is a ceiling on the loop itself: turn/depth limits and a per-agent budget that halts the chain (and its sub-agents) when it’s breached, not a monthly cap you discover you blew after the fact. Budget enforcement that understands delegation depth stops the runaway at turn 30 instead of turn 300.

Every one of those levers assumes you can see the split — loop vs leaf, which model, which agent, how many turns. A monthly invoice from your model provider can’t tell you whether your spend is the orchestration or the work, or which agent’s loop is the problem. That’s the gap a control plane closes: every governed call carries its model, its callKind, its estimated cost, and the agent it belongs to, so the loop tax is something you can measure and attribute — and therefore cut. That discipline is AI agent cost monitoring.

If you want to see your own loop-vs-leaf split, sign in — the cost X-ray is live in the dashboard. Or start with what an agentic control plane is for how the metering works.

Cost figures are a 2026-06 snapshot from our own internal workspaces, not customer data.

Share: Twitter LinkedIn

← back to blog