Skip to content
Agentic Control Plane

Control and Optimize Your Agents, Down to Each Tool Call

David Crowe · 11 min read
cost agents optimization architecture

You ship an agent. You see one answer come back.

Inside that one answer, the agent made 14 model calls and 9 tool calls, re-read 80,000 tokens of its own context, ran on three different models, and took anywhere from 30 seconds to two minutes depending on the run. It cost 1.2¢ this time. Last Tuesday the same task cost 2.2¢. You don’t know why, because all you ever see is the answer.

That’s the problem with agents: the unit you ship is the agent, but the unit that costs money, breaks, and varies is the tool call. An agent isn’t a function with a price tag. It’s a loop that makes a different sequence of decisions every time it runs. To improve it, you have to see inside the loop — and then act on what you see.

This is a walkthrough of doing exactly that: seeing each action an agent takes, controlling it, and optimizing it. The same four moves apply whether you built the agent in CrewAI, LangGraph, the OpenAI SDK, or a coding harness like Claude Code — because they all bottom out in the same atom: a governed tool call.

1. An agent is a distribution, not a number

The first mistake is treating an agent’s cost or latency as a single value. “It costs about a cent.” It doesn’t. It costs a distribution.

Run the same agent ten times and you get ten different costs, ten different runtimes, ten different paths through its tools. Average them and you’ve thrown away the only number that matters: the tail. A agent that costs 1¢ on a typical run and 12¢ on a bad one isn’t a 2¢-average agent. It’s a agent with a 12× tail that will surprise you at 3am when twenty of those bad runs land at once.

So the first view isn’t a number, it’s the spread:

  • Cost per run — mean, and the p95 next to it. The gap between them is your exposure.
  • Context read — how many tokens the loop re-read. This is almost always the real cost driver, because an agentic loop re-sends its growing context on every turn. More turns, more re-reading, more bill.
  • Runtime, model calls, tool calls — each as a distribution, each with its own tail.

When you look at an agent this way, the question changes from “what does it cost?” to “how predictable is it, and where does the spread come from?” That’s a question you can actually act on.

Cost per run · San Diego Volunteer Scout · 19 runs erratic · CV 93%
mean 0.81¢ free 2.17¢
Nineteen real runs of one agent. The mean is 0.81¢ — but runs range from free to 2.17¢, a coefficient of variation of 93% the dashboard flags "erratic." Several runs cost almost nothing; a handful cost 2¢+. The average hides both ends — and the band, not the mean, is what you're trying to shrink.

2. Find what’s actually expensive

Once you accept the agent is a distribution, you go hunting for the driver. Three things concentrate almost all the cost and almost all the surprise:

The loop tax. Most of an agent’s bill is not the useful work — it’s the orchestration loop re-reading its own context to decide what to do next. We’ve written about this at length: a long agentic run can spend 70–80% of its tokens just re-reading. If your agent’s cost is high, the first question isn’t “which tool is expensive” — it’s “how many times is the loop going around, and how much is it re-reading each time.”

Cost & variance by step · San Diego Volunteer Scout · 19 runs
StepCalls/runCost meanCost shareLatency meanLatency share
llm.agent_run loop cost driver variance111.40¢100%4.4s12%
chat.completion tool slowest7.8free37s82%
apitools.webSearch tool15free646ms3%
log.record tool5.1free101ms0%
memory.search tool1.5free129ms0%
llm.eval_judge leaf1.00.020¢0%6.7s0%
The real "cost & variance by step" view for this agent's 19 runs. The punchline you can't get from a total: cost and time live in different rows. llm.agent_run — the model inferences — hold 100% of the cost but only 12% of the wall-clock. chat.completion holds 82% of the time and costs nothing — because it's the agent's turn span, not a step: the loop iteration that wraps everything else. Sort by cost and you find the loop's re-reading; sort by latency and you find the loop's length. Two levers, two rows — invisible in any single number.

Why is one step both free and the slowest thing in the run? Here’s where the X-ray earns its keep: open chat.completion and it has no model and no tokens of its own. It isn’t a peer model call at all — it’s the turn span, the wrapper around one whole iteration of the agent loop. There are 193 of these spans over 793 llm.agent_run inferences beneath them — about four model calls per turn, plus the tool calls. So its 37 seconds isn’t one slow generation; it’s the entire loop iteration, end to end. The actual inferences inside it run about 4 seconds each.

That reframes the whole problem. “The latency is in chat.completion” really means the latency is the loop — this agent takes eleven model turns and fifteen web searches per run, most of them in sequence. You don’t fix that with a faster model; you fix it with a shorter loop: fewer turns, tool calls fired in parallel instead of one at a time, context cached so the agent converges sooner. The cost is the other lever entirely — every one of those turns re-reads about 7,800 tokens of context, and that re-reading, not the tools, is the whole bill. Two levers, two rows. The number that says “4¢ and 37 seconds” hides both; the X-ray is what splits them so you pull the right one.

The model. An agent usually runs on more than one model — a big one for orchestration, a small one for sub-tasks, maybe a third for a specific step. The model carrying a small share of calls but most of the cost is the one to route off the hot path. You can’t see that until you break cost down by model, per agent.

The one step. Cost is never evenly spread across tools. One step — usually a model call inside the loop — concentrates the spend. Find it and you’ve found 80% of your optimization in one row. The same is true for latency: one step is the slowest, and it’s usually not the one you’d guess.

The move here is to stop looking at the agent’s total and start looking at its drivers: cost driver (biggest bill), variance driver (what makes it unpredictable), and the slowest step. Three rows. That’s where the money and the latency live.

3. Understand how it actually runs

Here’s the part most tooling misses. An agent doesn’t run the same way twice. Knowing the average cost of a step tells you nothing about when it fires, how often, or what comes after it.

So look at the agent as a flow — every run laid over every other run, step by step. Where do runs branch? Where do they converge? Which path is the cheap one and which is the expensive one? When you shade that flow by outcome, you can see where runs fail. Shade it by cost and you see where the money pools. Shade it by latency and you see where time goes. Same flow, four different stories.

This is the difference between observing an agent and understanding it. The flow tells you the agent has, say, four distinct paths across its runs — and that one of them is twice as expensive and fails half the time. Now you have something concrete: find the good path and standardize on it. Collapse the variance. An agent that always takes its cheapest successful path is a cheaper, more reliable agent, and you got there by looking at the flow, not the average.

4. Control it

Seeing is half of it. The other half is being able to change the thing without redeploying code.

This is where governance stops being a compliance checkbox and becomes an optimization tool. The controls that matter, per agent:

  • The model. If the variability view shows the orchestration loop running on an expensive model when a cheaper one would do, you change it — right there, not in a config file three repos away.
  • The budget. Set a per-run cap at the p95 you saw in the distribution. Now a runaway loop can’t cost you 12¢; it gets cut off at your number.
  • The tools. An agent should be able to call exactly the tools it needs and nothing else. The list of tools it’s allowed to call, sitting next to the list of tools it actually called, tells you immediately where to tighten — a tool that’s allowed but never used is attack surface you can remove; a tool it’s reaching for that isn’t on the list is a gap. Allow or deny each one. Removed tools leave its reach entirely.
  • The kill switch. When an agent misbehaves, you stop it — instantly, before the next run, without a deploy.

The principle underneath all of this: a per-agent control composes under a workspace policy. The platform owner sets the guardrails for the whole fleet; the builder tunes one agent inside them. Same governance model, two altitudes.

5. Optimize it

Now the loop closes. Everything you saw in steps 1–3 turns into a change in step 4, and you measure the result back in step 1:

  • The variability view showed a 12× cost tail → you set a budget cap at p95 and route the loop’s model down a tier. The tail collapses.
  • The model breakdown showed Sonnet carrying 8% of calls and 60% of cost → you move that step to a cheaper model and watch mean cost drop.
  • The flow showed one expensive, flaky path → you prune the tool that sends it there, or tighten the prompt, and the agent converges on the cheap path.
  • The allowlist showed three tools never used in a month → you remove them, shrinking both cost and attack surface.

None of these are guesses. Each one is a specific row in a specific view, turned into a specific control, measured by the same distribution you started with. That’s what “optimize” actually means for an agent: not a vibe, a loop. See the spread, find the driver, change the lever, watch the spread move.

The atom under all of it

Every one of these moves — the distribution, the drivers, the flow, the model swap, the budget, the allowlist — is built on one thing: a governed tool call. Capture each action the agent takes, with its cost, latency, model, and outcome, and everything else composes off it. Cost is the sum of the calls. The path is the order of the calls. The policy is what’s allowed at each call. Control the tool call and you control the agent.

That’s the whole idea behind a control plane for agents: not a dashboard you glance at, but a loop you run — see each action, control it, optimize it, repeat — across every agent and every framework you ship.

It’s free for individuals — one install, no code changes, and every tool call is governed. See your first governed tool call in about thirty seconds →

Get the next post
Agentic governance, AgentGovBench updates, the occasional incident post-mortem. One email per post. No marketing fluff.
Share: Twitter LinkedIn
Related posts

← back to blog