We Benchmarked 14 Models on Real Agent Runs
Every agent you build has to pick a model, and most builders make that call on vibes: “use one good model for everything” or “always use the best one.” Both are wrong, because a single agent run isn’t one kind of work. Choosing a tool, pulling fields from a web page, writing a query, summarizing, and holding a multi-step plan together are different jobs — not equally hard, not equally costly to do well.
So we stopped guessing and measured it. We tested 14 models — the current flagships from Anthropic, OpenAI, and Google plus the strongest open models (Llama, DeepSeek, Qwen, Mistral, GLM) — two ways: as isolated tool calls across seven categories, and as real agents running a multi-step job end to end. The result is more useful than “model X is best,” and one finding should change how you route.
Two ways to test a model
Isolated microtests measure one decision in a vacuum: given a question and six tools, pick the right one; given a schema, write correct SQL; given messy HTML, extract clean JSON. We grade each 0–1 — deterministic checks where we can, a panel of three judge models for the prose ones. Cost is realized dollars from live pricing × actual tokens, not a sticker price.
The agent eval measures whether a model can actually run an agent: a real multi-turn loop where it must search, read what came back, decide what’s next, and finish the job. Same model — but now every step depends on the last. We run each scenario three times and report the spread, because a single lucky trajectory is a data point, not a ranking.
Finding 1: isolated skill tells you almost nothing about agent skill
On the microtests, almost everyone is fine — across all seven categories every model landed between 0.83 and 0.95. Picking a tool, building a payload, summarizing: capable models tie, and where they tie the cheapest competent one wins.
Then we ran the same models as agents. Watch what happens to the exact same models:
| Model | Isolated microtest avg | Agent completion |
|---|---|---|
| Claude Opus 4.8 | 0.95 | 0.77 ±0.08 |
| Claude Sonnet 4.6 | 0.95 | 0.66 ±0.32 |
| Claude Haiku 4.5 | 0.94 | 0.65 ±0.22 |
| GPT-5.2 | 0.94 | 0.52 ±0.37 |
| GPT-4o | 0.92 | 0.50 ±0.24 |
| Qwen 2.5 72B | 0.94 | 0.48 ±0.42 |
| GPT-5 mini | 0.95 | 0.47 ±0.34 |
| Llama 3.3 70B | 0.92 | 0.43 ±0.37 |
| GLM 5.2 | 0.95 | 0.43 ±0.43 |
| GPT-4o mini | 0.92 | 0.42 ±0.36 |
| Gemini 2.5 Pro | 0.85 | 0.35 ±0.30 |
| Gemini 2.5 Flash | 0.94 | 0.22 ±0.15 |
| DeepSeek V3.2 | 0.83 | 0.06 ±0.06 |
The microtest column is nearly flat. The agent column ranges from 0.77 to 0.06. DeepSeek scores 0.83 on isolated calls and 0.06 running an agent — it writes a perfect query in a vacuum and then never drives a loop to a finish. Qwen and GLM ace the microtests (0.94, 0.95) and complete fewer than half their agent runs. How a model does on a single tool call barely predicts whether it can run an agent — so if you pick your loop model off a function-calling leaderboard (which tests isolated calls), you’ll pick exactly wrong.
Finding 2: cheap models aren’t worse — they’re erratic
Look at the ± column. Opus completes at 0.77 ±0.08 — steady. The cheap models swing violently: GLM ±0.43, Qwen ±0.42, Llama ±0.37. A model that completes the same task 0% of the time on one run and 90% on the next isn’t a cheaper option — it’s a coin flip, and a coin-flip orchestrator is worse than a reliably-mediocre one, because you can’t build on it. This is the part a single run hides: GLM looked like a perfect 1.00 until we ran it three times and saw 0.43 ±0.43. Repetition turned a fluke into the real story.
So the rule, now with evidence: keep one coherent, consistent model on the loop, and route the cheap, leaf work — extraction, summarization, a single call — to whatever’s cheapest-competent. Best value driver in our set: Claude Haiku (0.65 completion at a fraction of frontier cost). Most reliable: Opus, for the genuinely hard, long-horizon agents. The priciest model is rarely the best buy for the loop — and never for the leaves.
This isn’t a contrarian take; it’s where the field is going (the Berkeley Function-Calling Leaderboard documents the same multi-turn cliff, and production agent stacks increasingly run one loop model plus cheaper swappable slots). Our contribution is measuring it on real agent runs, with repeats.
Finding 3: production agrees — the loop is where the money is
A controlled benchmark is one thing. So we checked it against the real corpus: 210,000+ tool calls metered through our gateway. The production spend confirms the benchmark’s whole premise — one frontier model on the orchestration loop was ~80% of the bill across under 8% of the calls, while the cheap model did the bulk of the work for a few percent of the cost. The loop is where the money is, which is exactly why the loop model is the one decision worth getting right — and the benchmark says it shouldn’t default to the most expensive option. (Full breakdown in the cost teardown and the loop tax.)
What we don’t trust yet
We’d rather you take this as a strong v0 than gospel:
- Only two agent scenarios. “Run a real agent” should mean a dozen archetypes — a scout, a coder, a multi-agent crew — not two. The harness is built; the eval set needs to grow before any single ranking is final.
- Simulated tools. The agent eval uses mocked tool backends for reproducibility. The real prize is scoring models on actual outcomes — and because every real tool call’s cost and result is recorded, that’s the next version: an outcome-grounded benchmark on the live corpus.
- Judges aren’t humans. A panel of three beats one grader, but the completion and prose scores still want a human spot-check before anyone bets on a single row.
The headline is the part we’d stand behind: isolated skill ≠ agent skill, cheap models are erratic on the loop, and the expensive model is rarely the best buy. For the loop, pick coherent. For the leaves, pick cheap. And measure it for your own workload — which you can only do if you can see and price every model and tool call your agents make.