Skip to content
Agentic Control Plane

One step is 90% of our agent's model bill — on purpose

David Crowe · 4 min read
cost model-routing agents agentic-control-plane optimization

We run an agent that builds other agents. You describe what you want in a sentence; it parses the intent, picks a framework, picks the tools, validates the plan, writes the system prompt, and emits a runnable manifest. Six steps. Three of them call a language model.

If you run all three on your best model, you’re paying for top-tier reasoning on work that doesn’t need it — and most agents do exactly this, because picking a model per step requires knowing what each step costs, and most teams don’t.

We do, because Agentic Control Plane meters every tool call. Every LLM call lands in the log with the model, the token counts, and the dollar cost on the row. So we can ask a question most people can’t answer about their own agents: where does the money actually go?

90% of the spend is one step

Mining our own dogfood workspaces over a recent window — 201,377 governed tool calls, of which 886 were the LLM calls that cost real money — the spend wasn’t spread evenly. It wasn’t close to evenly:

Model LLM calls Spend Share
claude-sonnet-4-5 137 $25.80 90.5%
gemini-2.5-pro 106 $1.20 4.2%
gemini-2.5-flash 454 $0.83 2.9%
claude-haiku-4-5 189 $0.66 2.3%
Total 886 $28.49  

Numbers are from our own workspaces — no customer data. The totals are small because the volume is small; the shape is the point.

15% of the calls are 90% of the bill. Every Sonnet call traces back to one step: compose_prompt, where the agent writes the system prompt for the agent it’s building. That’s not a bug to fix. It’s the design working.

Why one step earns the expensive model

The models aren’t close in price. Our builder runs two tiers — a cheap model for the mechanical steps, a frontier model for the one that writes:

Tier Model Input / 1M Output / 1M
fast claude-haiku-4-5 $0.80 $4.00
deep claude-sonnet-4-5 $3.00 $15.00

That’s about a 4× gap between the two tiers, inside Anthropic’s own ladder — and if you reach down to a budget model like Gemini Flash, the cheap-vs-frontier spread runs 10–16× as of early 2026. Even at the conservative end, the only cost question that matters is which step runs which tier. Get six steps onto the right one and the bill takes care of itself.

So we route by what the step is actually doing:

  • parse_intent, pick_tools — extraction and structure. “Turn this sentence into fields.” “Choose from this list.” A cheap model nails these; a frontier model is paying Sonnet rates to fill in a form. → Haiku.
  • compose_prompt — prose. This is the one step where model quality shows up directly in the output: a better model writes a tighter, more capable system prompt, and the quality of every agent the builder ships rides on it. → Sonnet.

That’s why Sonnet is 90% of the bill on 15% of the calls. The expensive model runs exactly once per build, on the one step that converts spend into quality. The other five run cheap.

Here’s the part that’s easy to miss: we could only draw that line because the cost was attributed per call. A monthly invoice tells you the agent cost $28. A per-model total tells you Sonnet was most of it. Neither tells you that Sonnet on compose_prompt is worth it and Sonnet on pick_tools would be waste — and that sentence is the entire optimization. Without per-step cost, “route the cheap work to cheap models” is advice you can’t actually act on, because you can’t see which work is cheap.

What this is worth

Run the counterfactual. If parse_intent and pick_tools ran on Sonnet instead of Haiku — the lazy default of “just use the good model everywhere” — those two steps would cost roughly 4× more each, for output a cheap model already gets right. Push them to Flash and the gap is wider still. Across a fleet of agents each making thousands of runs, “which tier does this step belong on” is the difference between a model bill you control and one you explain after the fact.

And the same per-call data that lets you tune deliberately also catches the routing breaking undeliberately — a step quietly running the wrong model when it shouldn’t. That happened to us too, on the run paths nobody watches; the logs teardown is that story.

The takeaway isn’t “use cheap models.” It’s that “not every step needs your most expensive model” is only useful if you can see, per step, what each one costs and what it buys. That’s the cost pillar of ACP: every tool call priced and attributed to the agent, the user, and the model behind it — with hard caps that halt a run instead of explaining the bill afterward.

See exactly where your agents are spending →

Get the next post
Agentic governance, AgentGovBench updates, the occasional incident post-mortem. One email per post. No marketing fluff.
Share: Twitter LinkedIn
Related posts

← back to blog