Control and Optimize Your Agents, Down to Each Tool Call
You ship an agent. You see one answer come back.
Inside that one answer, the agent made 14 model calls and 9 tool calls, re-read 80,000 tokens of its own context, ran on three different models, and took anywhere from 30 seconds to two minutes depending on the run. It cost 1.2¢ this time. Last Tuesday the same task cost 2.2¢. You don’t know why, because all you ever see is the answer.
That’s the problem with agents: the unit you ship is the agent, but the unit that costs money, breaks, and varies is the tool call. An agent isn’t a function with a price tag. It’s a loop that makes a different sequence of decisions every time it runs. To improve it, you have to see inside the loop — and then act on what you see.
This is a walkthrough of doing exactly that: seeing each action an agent takes, controlling it, and optimizing it. The same four moves apply whether you built the agent in CrewAI, LangGraph, the OpenAI SDK, or a coding harness like Claude Code — because they all bottom out in the same atom: a governed tool call.
1. An agent is a distribution, not a number
The first mistake is treating an agent’s cost or latency as a single value. “It costs about a cent.” It doesn’t. It costs a distribution.
Run the same agent ten times and you get ten different costs, ten different runtimes, ten different paths through its tools. Average them and you’ve thrown away the only number that matters: the tail. A agent that costs 1¢ on a typical run and 12¢ on a bad one isn’t a 2¢-average agent. It’s a agent with a 12× tail that will surprise you at 3am when twenty of those bad runs land at once.
So the first view isn’t a number, it’s the spread:
- Cost per run — mean, and the p95 next to it. The gap between them is your exposure.
- Context read — how many tokens the loop re-read. This is almost always the real cost driver, because an agentic loop re-sends its growing context on every turn. More turns, more re-reading, more bill.
- Runtime, model calls, tool calls — each as a distribution, each with its own tail.
When you look at an agent this way, the question changes from “what does it cost?” to “how predictable is it, and where does the spread come from?” That’s a question you can actually act on.
2. Find what’s actually expensive
Once you accept the agent is a distribution, you go hunting for the driver. Three things concentrate almost all the cost and almost all the surprise:
The loop tax. Most of an agent’s bill is not the useful work — it’s the orchestration loop re-reading its own context to decide what to do next. We’ve written about this at length: a long agentic run can spend 70–80% of its tokens just re-reading. If your agent’s cost is high, the first question isn’t “which tool is expensive” — it’s “how many times is the loop going around, and how much is it re-reading each time.”
| Step | Calls/run | Cost mean | Cost share | Latency mean | Latency share |
|---|---|---|---|---|---|
| llm.agent_run loop cost driver variance | 11 | 1.40¢ | 100% | 4.4s | 12% |
| chat.completion tool slowest | 7.8 | free | — | 37s | 82% |
| apitools.webSearch tool | 15 | free | — | 646ms | 3% |
| log.record tool | 5.1 | free | — | 101ms | 0% |
| memory.search tool | 1.5 | free | — | 129ms | 0% |
| llm.eval_judge leaf | 1.0 | 0.020¢ | 0% | 6.7s | 0% |
llm.agent_run — the model inferences — hold 100% of the cost but only 12% of the wall-clock. chat.completion holds 82% of the time and costs nothing — because it's the agent's turn span, not a step: the loop iteration that wraps everything else. Sort by cost and you find the loop's re-reading; sort by latency and you find the loop's length. Two levers, two rows — invisible in any single number.Why is one step both free and the slowest thing in the run? Here’s where the X-ray earns its keep: open chat.completion and it has no model and no tokens of its own. It isn’t a peer model call at all — it’s the turn span, the wrapper around one whole iteration of the agent loop. There are 193 of these spans over 793 llm.agent_run inferences beneath them — about four model calls per turn, plus the tool calls. So its 37 seconds isn’t one slow generation; it’s the entire loop iteration, end to end. The actual inferences inside it run about 4 seconds each.
That reframes the whole problem. “The latency is in chat.completion” really means the latency is the loop — this agent takes eleven model turns and fifteen web searches per run, most of them in sequence. You don’t fix that with a faster model; you fix it with a shorter loop: fewer turns, tool calls fired in parallel instead of one at a time, context cached so the agent converges sooner. The cost is the other lever entirely — every one of those turns re-reads about 7,800 tokens of context, and that re-reading, not the tools, is the whole bill. Two levers, two rows. The number that says “4¢ and 37 seconds” hides both; the X-ray is what splits them so you pull the right one.
The model. An agent usually runs on more than one model — a big one for orchestration, a small one for sub-tasks, maybe a third for a specific step. The model carrying a small share of calls but most of the cost is the one to route off the hot path. You can’t see that until you break cost down by model, per agent.
The one step. Cost is never evenly spread across tools. One step — usually a model call inside the loop — concentrates the spend. Find it and you’ve found 80% of your optimization in one row. The same is true for latency: one step is the slowest, and it’s usually not the one you’d guess.
The move here is to stop looking at the agent’s total and start looking at its drivers: cost driver (biggest bill), variance driver (what makes it unpredictable), and the slowest step. Three rows. That’s where the money and the latency live.
3. Understand how it actually runs
Here’s the part most tooling misses. An agent doesn’t run the same way twice. Knowing the average cost of a step tells you nothing about when it fires, how often, or what comes after it.
So look at the agent as a flow — every run laid over every other run, step by step. Where do runs branch? Where do they converge? Which path is the cheap one and which is the expensive one? When you shade that flow by outcome, you can see where runs fail. Shade it by cost and you see where the money pools. Shade it by latency and you see where time goes. Same flow, four different stories.
This is the difference between observing an agent and understanding it. The flow tells you the agent has, say, four distinct paths across its runs — and that one of them is twice as expensive and fails half the time. Now you have something concrete: find the good path and standardize on it. Collapse the variance. An agent that always takes its cheapest successful path is a cheaper, more reliable agent, and you got there by looking at the flow, not the average.
4. Control it
Seeing is half of it. The other half is being able to change the thing without redeploying code.
This is where governance stops being a compliance checkbox and becomes an optimization tool. The controls that matter, per agent:
- The model. If the variability view shows the orchestration loop running on an expensive model when a cheaper one would do, you change it — right there, not in a config file three repos away.
- The budget. Set a per-run cap at the p95 you saw in the distribution. Now a runaway loop can’t cost you 12¢; it gets cut off at your number.
- The tools. An agent should be able to call exactly the tools it needs and nothing else. The list of tools it’s allowed to call, sitting next to the list of tools it actually called, tells you immediately where to tighten — a tool that’s allowed but never used is attack surface you can remove; a tool it’s reaching for that isn’t on the list is a gap. Allow or deny each one. Removed tools leave its reach entirely.
- The kill switch. When an agent misbehaves, you stop it — instantly, before the next run, without a deploy.
The principle underneath all of this: a per-agent control composes under a workspace policy. The platform owner sets the guardrails for the whole fleet; the builder tunes one agent inside them. Same governance model, two altitudes.
5. Optimize it
Now the loop closes. Everything you saw in steps 1–3 turns into a change in step 4, and you measure the result back in step 1:
- The variability view showed a 12× cost tail → you set a budget cap at p95 and route the loop’s model down a tier. The tail collapses.
- The model breakdown showed Sonnet carrying 8% of calls and 60% of cost → you move that step to a cheaper model and watch mean cost drop.
- The flow showed one expensive, flaky path → you prune the tool that sends it there, or tighten the prompt, and the agent converges on the cheap path.
- The allowlist showed three tools never used in a month → you remove them, shrinking both cost and attack surface.
None of these are guesses. Each one is a specific row in a specific view, turned into a specific control, measured by the same distribution you started with. That’s what “optimize” actually means for an agent: not a vibe, a loop. See the spread, find the driver, change the lever, watch the spread move.
The atom under all of it
Every one of these moves — the distribution, the drivers, the flow, the model swap, the budget, the allowlist — is built on one thing: a governed tool call. Capture each action the agent takes, with its cost, latency, model, and outcome, and everything else composes off it. Cost is the sum of the calls. The path is the order of the calls. The policy is what’s allowed at each call. Control the tool call and you control the agent.
That’s the whole idea behind a control plane for agents: not a dashboard you glance at, but a loop you run — see each action, control it, optimize it, repeat — across every agent and every framework you ship.
It’s free for individuals — one install, no code changes, and every tool call is governed. See your first governed tool call in about thirty seconds →