Skip to content
Agentic Control Plane

The data

Most claims about AI agents are estimates. The numbers on this page aren’t — each one comes from something we ran, captured, metered, or scanned ourselves. This is the canonical index: the number, what it means, how it was produced, when, and the write-up with the full detail. If you cite one (please do), cite the method with it.

Two ground rules we hold ourselves to: cost and traffic figures are from our own workspaces — dogfood, not customer data — and every number links to the post where the methodology is spelled out, including its limitations.

Tool surfaces (captured from live traffic)

Number What it is Method & date Source
76 tools Declared by one real Claude Code session (v2.1, Chrome extension + connectors available): 35 core harness, 21 browser control, 20 connectors Captured from live API request bodies — the tools array the harness sends on every call. 2026-07 Tool Surface Index · the argued posture
64 of 76 Tools declared but never invoked in that session — standing capability, not used capability Same capture, declaration vs invocation log. 2026-07 ACP for coding agents
17 tools Declared by OpenAI Codex CLI (v0.142) via the Responses API Same capture method. 2026-07 Tool Surface Index
+21 tools, mid-session One session’s surface grew by 21 tools partway through the day — deferred tools loaded, no prompt, no changelog Declaration diffing across requests in captured traffic. 2026-07 Which tools to deny out of the box

Metered cost (priced per call, at API rates)

Number What it is Method & date Source
210,840 tool calls Governed tool calls metered across 94 of our own workspaces Every call through the ACP gateway logged with model, tokens, and estimated cost; ~10,200 calls sampled for the distributions What 210,000 agent tool calls actually cost (2026-04, updated 2026-06)
~89% Share of total spend that is the orchestration loop (the model re-reading context to pick the next step), not the leaf work Every call tagged callKind: loop or leaf; spend split by tag. 2026-06 snapshot The loop tax
80% of spend, 7.6% of calls One frontier model’s share of the bill vs its share of call volume — roughly 114× the per-call cost of the cheapest model in the workload Per-call cost attribution across the same 210,840 calls The teardown
85% reads Share of sampled tool calls that are read operations (read_file, grep, cd, …) Tool-name classification over the ~10,200-call sample The teardown
10.3 seconds Average duration of an orchestration step (chat.completion) — the loop is the slow part and the expensive part Metered latency on governed calls. 2026-06 The loop tax
$148.16 One full working day of Claude Code on a Max subscription, priced at API rates: 276 model calls, 1,697 tool calls Model traffic routed through the ACP cost proxy; each call priced at list rates while the subscription passes through untouched. 2026-06 ACP for coding agents · Claude Code cost tracking
100% loop tax, 72M tokens That $148 session’s spend was entirely loop — 72M tokens of context re-read at a 100% cache hit rate (the only reason it wasn’t ~10× more); the last 28 turns bought 10% of the output Turn-by-turn session X-ray on the same proxy data Claude Code cost tracking
90% on one step Share of our agent-builder’s model bill spent on a single step — deliberately, because per-call attribution let us route each step to the model it needs Per-step cost attribution on our own production agent. 2026-06 One step is 90% of our agent’s model bill

Model benchmark (agents, not leaderboards)

Number What it is Method & date Source
14 models, two ways Flagships from Anthropic, OpenAI, Google + open models (Llama, DeepSeek, Qwen, Mistral, GLM), tested as isolated tool calls and as full agent loops Deterministic grading where possible, a 3-judge model panel for prose; agent runs scored on completion, 3 runs per scenario with spread reported; cost = live pricing × actual tokens. 2026-06 We benchmarked 14 models on real agent runs
0.83–0.95 vs 0.06–0.77 The whole field ties on isolated tool calls; the same models spread by more than 10× on completing a real agent loop Same benchmark, both test modes The benchmark
0.83 → 0.06 DeepSeek V3.2’s isolated score vs its agent-completion score — perfect calls in a vacuum, cannot drive a loop to a finish Same benchmark The benchmark

Ecosystem scans (security research)

Number What it is Method & date Source
7,522 skills scanned Every skill on the ClawHub registry, statically analyzed: 4,931 findings across 746 skills, ~61% estimated false-positive rate after triage 40 regex patterns from published research (Snyk, Cisco, Kaspersky), run airgapped in Docker with --network=none; static analysis — a floor, not a ceiling I audited 7,522 AI agent skills (2026-03)
8,216 MCP servers Public MCP servers scanned for input-validation posture (7,840 tools) and classified by auth appropriateness Registry-wide static scans; methodology and caveats in each post Input validation · auth appropriateness (2026-03)

Using these numbers

Cite freely with attribution — “per Agentic Control Plane’s metered data” plus a link to the source post is ideal, because each post carries the caveats that keep the number honest (sample sizes, dogfood-not-customer scope, static-analysis limits). If a number here disagrees with a post, the post is canonical and this page needs an update — tell us.

The captures and meters that produce this data run continuously. To point them at your own agents:

curl -sf https://agenticcontrolplane.com/install.sh | bash