The data
Most claims about AI agents are estimates. The numbers on this page aren’t — each one comes from something we ran, captured, metered, or scanned ourselves. This is the canonical index: the number, what it means, how it was produced, when, and the write-up with the full detail. If you cite one (please do), cite the method with it.
Two ground rules we hold ourselves to: cost and traffic figures are from our own workspaces — dogfood, not customer data — and every number links to the post where the methodology is spelled out, including its limitations.
Tool surfaces (captured from live traffic)
| Number | What it is | Method & date | Source |
|---|---|---|---|
| 76 tools | Declared by one real Claude Code session (v2.1, Chrome extension + connectors available): 35 core harness, 21 browser control, 20 connectors | Captured from live API request bodies — the tools array the harness sends on every call. 2026-07 |
Tool Surface Index · the argued posture |
| 64 of 76 | Tools declared but never invoked in that session — standing capability, not used capability | Same capture, declaration vs invocation log. 2026-07 | ACP for coding agents |
| 17 tools | Declared by OpenAI Codex CLI (v0.142) via the Responses API | Same capture method. 2026-07 | Tool Surface Index |
| +21 tools, mid-session | One session’s surface grew by 21 tools partway through the day — deferred tools loaded, no prompt, no changelog | Declaration diffing across requests in captured traffic. 2026-07 | Which tools to deny out of the box |
Metered cost (priced per call, at API rates)
| Number | What it is | Method & date | Source |
|---|---|---|---|
| 210,840 tool calls | Governed tool calls metered across 94 of our own workspaces | Every call through the ACP gateway logged with model, tokens, and estimated cost; ~10,200 calls sampled for the distributions | What 210,000 agent tool calls actually cost (2026-04, updated 2026-06) |
| ~89% | Share of total spend that is the orchestration loop (the model re-reading context to pick the next step), not the leaf work | Every call tagged callKind: loop or leaf; spend split by tag. 2026-06 snapshot |
The loop tax |
| 80% of spend, 7.6% of calls | One frontier model’s share of the bill vs its share of call volume — roughly 114× the per-call cost of the cheapest model in the workload | Per-call cost attribution across the same 210,840 calls | The teardown |
| 85% reads | Share of sampled tool calls that are read operations (read_file, grep, cd, …) |
Tool-name classification over the ~10,200-call sample | The teardown |
| 10.3 seconds | Average duration of an orchestration step (chat.completion) — the loop is the slow part and the expensive part |
Metered latency on governed calls. 2026-06 | The loop tax |
| $148.16 | One full working day of Claude Code on a Max subscription, priced at API rates: 276 model calls, 1,697 tool calls | Model traffic routed through the ACP cost proxy; each call priced at list rates while the subscription passes through untouched. 2026-06 | ACP for coding agents · Claude Code cost tracking |
| 100% loop tax, 72M tokens | That $148 session’s spend was entirely loop — 72M tokens of context re-read at a 100% cache hit rate (the only reason it wasn’t ~10× more); the last 28 turns bought 10% of the output | Turn-by-turn session X-ray on the same proxy data | Claude Code cost tracking |
| 90% on one step | Share of our agent-builder’s model bill spent on a single step — deliberately, because per-call attribution let us route each step to the model it needs | Per-step cost attribution on our own production agent. 2026-06 | One step is 90% of our agent’s model bill |
Model benchmark (agents, not leaderboards)
| Number | What it is | Method & date | Source |
|---|---|---|---|
| 14 models, two ways | Flagships from Anthropic, OpenAI, Google + open models (Llama, DeepSeek, Qwen, Mistral, GLM), tested as isolated tool calls and as full agent loops | Deterministic grading where possible, a 3-judge model panel for prose; agent runs scored on completion, 3 runs per scenario with spread reported; cost = live pricing × actual tokens. 2026-06 | We benchmarked 14 models on real agent runs |
| 0.83–0.95 vs 0.06–0.77 | The whole field ties on isolated tool calls; the same models spread by more than 10× on completing a real agent loop | Same benchmark, both test modes | The benchmark |
| 0.83 → 0.06 | DeepSeek V3.2’s isolated score vs its agent-completion score — perfect calls in a vacuum, cannot drive a loop to a finish | Same benchmark | The benchmark |
Ecosystem scans (security research)
| Number | What it is | Method & date | Source |
|---|---|---|---|
| 7,522 skills scanned | Every skill on the ClawHub registry, statically analyzed: 4,931 findings across 746 skills, ~61% estimated false-positive rate after triage | 40 regex patterns from published research (Snyk, Cisco, Kaspersky), run airgapped in Docker with --network=none; static analysis — a floor, not a ceiling |
I audited 7,522 AI agent skills (2026-03) |
| 8,216 MCP servers | Public MCP servers scanned for input-validation posture (7,840 tools) and classified by auth appropriateness | Registry-wide static scans; methodology and caveats in each post | Input validation · auth appropriateness (2026-03) |
Using these numbers
Cite freely with attribution — “per Agentic Control Plane’s metered data” plus a link to the source post is ideal, because each post carries the caveats that keep the number honest (sample sizes, dogfood-not-customer scope, static-analysis limits). If a number here disagrees with a post, the post is canonical and this page needs an update — tell us.
The captures and meters that produce this data run continuously. To point them at your own agents:
curl -sf https://agenticcontrolplane.com/install.sh | bash