Skip to content
Agentic Control Plane

What 28,000 agent tool calls look like

David Crowe · · 6 min read
observability replay governance agents

Most “AI observability” posts are someone walking you through a Datadog dashboard. We wanted to write a different one. We run Agentic Control Plane — a gateway that sits between AI clients and the tools they call. Because of where we sit, we see one thing almost nobody else does: the same human using every client, calling every tool, with every policy decision attached to every row.

Over the last 70 days, across a handful of dogfood workspaces, we’ve logged 28,256 of those events. Today we’re shipping a feature built directly from the patterns we saw in that data: Session Replay. Before we get to that, here’s what 70 days of watching agentic work in production actually looks like.

All numbers below are from our own workspaces. No customer data. The richest tenant is literally “my laptop + a few side projects” — the interesting thing is how much signal that single human generated across how many clients.

1. Agentic work is plural by default

The loudest fact in the data: 14 distinct MCP client applications showed up in the stream from a single human. Claude Code, Claude Desktop, ChatGPT, Codex, Cursor, Lovable, our OpenClaw plugin, our own chat UI, the dashboard itself, Test Harness, plus a heartbeat service that phones home. In volume:

Client Events
Claude Code 11,882
Claude Code plugin 916
Chat UI 328
Agent runs 157
ChatGPT 22
Claude Desktop 14
OpenClaw 8
Lovable 7
Codex 3

Obvious implication: any “observability for agents” product that only sees one client is looking at one fish in a reef. The unit of analysis is the human doing work, not the app they happened to open.

2. The tool mix is not “MCP tools”. It’s actual cognitive work.

Here’s the top of the tool histogram:

Tool Calls
chat.completion (LLM proxy) 11,067
slack.listChannels 1,111
Read.src (Claude Code) 1,089
WebSearch 981
Bash 976
Edit.src 888
Read 849
slack.searchMessages 750
Grep 717
Bash.npm 616
Bash.ls 555
apitools.fetchUrl 483
TaskUpdate 441

Two things jump out:

  • The coding-agent tools (Read, Edit, Bash, Grep, TaskUpdate) are in there next to the SaaS tools (slack.*, google.*, notion.*, github.*). One log stream covers “I edited a TypeScript file” and “I searched Slack for last week’s launch checklist.” That’s the actual shape of daily AI-assisted work.
  • chat.completion is the #1 entry. If you’re only observing “tool use”, you’re ignoring the majority of the surface. The LLM call is the tool call in most sessions.

3. Governance signal is populated

We built PII detection, prompt-injection detection, and a policy decision engine into the gateway. The question we wanted to answer by shipping it: would anyone ever see a hit in practice? Here’s the distribution:

  • 230 PII detections (0.8% of events): date_of_birth: 108, email: 86, ip_address: 47, phone: 10, ssn: 7, credit_card: 2.
  • 64 prompt-injection detections.
  • 12,890 policy-decision rows — allow/deny with a reason string, mostly from Claude Code’s pre-tool-use hook.

That credit_card: 2 is the interesting one, because those two rows represent actual moments where a Claude Code session was about to send a card number into a tool and the gateway flagged it. Without the gateway, nobody would have known. That’s the audit story in a single datum.

4. Errors aren’t rare — they’re a product

9.4% of events failed. 2,662 errors across 70 days. Some are transient (rate limits, flaky APIs). Many are systematic — the same model failing the same tool the same way. Because each failed row carries the client, the tool, the input shape, and the error message, “what are the top ways agents break” becomes a query, not a hunch.

We’ll publish the failure atlas in a follow-up post. The preview: coding agents and SaaS agents fail differently, and the highest error rate in our data isn’t where anyone would predict.

5. Sessions are the missing unit

Here is the thing we actually shipped today. When you look at 28,000 rows in a flat table, they’re just noise. When you group them by sessionId, something obvious emerges: an agent session is a story with a beginning, a middle, an end, a cast of tools, a set of decisions, and usually at least one interesting moment. Pulled out, the average Claude Code session in our data is:

  • ~30–40 tool calls
  • spans 5–20 minutes
  • touches 4–8 distinct tools
  • hits at least one deny or error about 45% of the time

That last number is the one that surprised us, and it’s the motivation for the feature.

Shipping today: Session Replay

Starting now, every ACP workspace gets a new Sessions view. Two screens:

  • Sessions list — every agent session from the last 24h / 7d / 30d, grouped by sessionId, summarised with client, working directory, tool count, duration, allow/deny split, PII hits, injection hits, error count. Filter to “only sessions with issues” in one click.
  • Session replay — click any session. You get a scrubbable timeline of every tool call, with the policy decision, the PII/injection findings, the input/output preview (if your policy retains them), and a one-key “next issue” jump that takes you to the next deny / error / PII / injection in the session.

This is a feature you could only build if your logs have: identity, tool, decision, reason, content scan, client, session, and working directory — in one row, across every client. We’ve had that schema for a while. What we were missing was a view that made it obvious.

Why this matters beyond “a nice dashboard”

Two reasons.

1. Replay is how you actually debug agents. A stack trace tells you where a program crashed. A transcript tells you what a human said. Neither captures what an agent did. A session timeline with policy decisions is the artifact auditors, engineers, and security teams have all been asking us for, in slightly different words.

2. Replay is how you justify a decision. When your compliance officer asks “prove no Claude Code session exfiltrated a credit card last quarter”, you do not want to answer with logs. You want to answer with a filter: pii=credit_card, client=Claude Code, range=90d. That query works now.

Try it

If you already have an ACP workspace, /sessions is live in your dashboard today. Need one? Sign in and the onboarding flow provisions one in about 30 seconds.

If you don’t run agents through a gateway yet and you’re trying to figure out whether it’s worth it — this post is the honest answer to that question. 70 days of being able to query our own AI work is worth more than 70 days of screenshots and chat transcripts. The asymmetry gets bigger every week.


One caveat: every number in this post came from our own internal workspaces, not customer data. Governance starts at home.

Share: Twitter LinkedIn
Related posts

← back to blog