Skip to content
Agentic Control Plane

Session X-Ray: Debugging a Single Agent Run, Call by Call

David Crowe · 4 min read
cost agents debugging architecture

In the companion to this post, we looked at an agent as a distribution — every run laid over every other, so you can see the spread, the tail, and the path it usually takes. That’s the view you use to decide which runs are worth opening.

This is about opening one.

A single run came back at 4.0¢ and took 11 seconds. Another identical request cost 1.2¢ and took 4. You don’t redesign the agent off an average — you open the expensive run and find the step that did it. That’s the Session X-ray: one execution, every call in the order it happened, with the cost and latency of each, and the inputs and outputs of any step you want to inspect.

The whole run, in order

The first thing the Session X-ray gives you is the waterfall — every model call and tool call the run made, top to bottom, with how long each took and what it cost:

One run · 7 calls · 4.0¢ · 11.2s
llm.plan0.9¢
web.search
arxiv.fetch
db.query
llm.synthesize2.1¢
llm.verify0.4¢
notify.email
Bar width is wall-clock; the number is cost. One step — llm.synthesize — is both the longest bar and the biggest charge. The three searches are nearly free. You found 80% of the problem in one row.

This is the thing the answer hides from you. From the outside, the run is a black box that took 11 seconds. Inside, 64% of that time and over half the cost is one synthesis call re-reading everything the three searches returned. The searches you’d have suspected — the “tool calls,” the external stuff — are rounding error.

Three questions, three reads

Once the whole run is in front of you, debugging is just three questions, and each one is a column you sort by:

What cost the most? Sort by cost. It’s almost always a model call inside the loop, not a tool. Here it’s llm.synthesize at 2.1¢ — because it’s the step that ingests the most context. That’s your loop tax made concrete: the bill isn’t the work, it’s the re-reading before the work.

What took the longest? Sort by latency. Sometimes it’s the same step; often it isn’t. A slow run can be one model call that stalled, or a tool waiting on a flaky upstream. The waterfall tells you which, instantly, instead of guessing.

What failed or got blocked? A run that errored has a step where it errored. A run that got a tool denied by policy has the deny, in place, with the reason. You don’t reconstruct the failure from logs scattered across three services — you see it where it happened, in the sequence that led to it.

Then you read the step

Finding the expensive step is half of it. The other half is understanding why — and for that you open the call itself. The Session X-ray lets you inspect the redacted input and output of any step: what llm.synthesize was actually handed, and what it produced.

That’s where the fix reveals itself. Maybe synthesis is being handed all three full search payloads when it only needed the top result from each — so you trim what feeds it. Maybe it’s running on a frontier model when the task is extraction — so you route that one step down a tier. Maybe the same context is re-sent on every loop turn unchanged — so you cache the stable prefix. You can’t make any of those calls from a cost total. You make them from the call’s actual inputs.

Why “per call” is the unit

Notice what every one of those moves has in common: it’s a change to one call, justified by one call’s data. Not “the agent is expensive” — llm.synthesize on this run was handed 9,000 tokens it didn’t need. Not “it’s slow sometimes” — this step stalled for 7 seconds on attempt two. The Session X-ray works because it never abstracts away the atom. A run is its calls, in order, each with a cost, a latency, an input, an output, and an outcome. Capture that and debugging stops being archaeology.

The Agent X-ray tells you which run to open. The Session X-ray tells you what to change inside it. Together they’re the loop: see the distribution, open the outlier, read the call, change one thing, watch the distribution move.

It’s free for individuals — one install, no code changes, and every call is captured. See your first run in about thirty seconds →

Get the next post
Agentic governance, AgentGovBench updates, the occasional incident post-mortem. One email per post. No marketing fluff.
Share: Twitter LinkedIn
Related posts

← back to blog