Our agent got quietly worse, and nothing errored
An agent that throws an exception is the easy case. You get a stack trace, an alert, a red line on a dashboard, and you go fix it. The hard case is the agent that keeps running, returns on time, throws nothing — and is quietly worse than it was last week. Same inputs, flabbier outputs. Nothing to grep for. Nothing wakes you up.
We run an agent that builds other agents, and it had exactly this problem. We found it by reading its own logs.
The one step the whole product rides on
The builder writes a system prompt for each agent it ships. That step — compose_prompt — is where model quality turns into product quality: a stronger model writes a tighter, more capable prompt, and every agent that comes out of the builder inherits it. So we pinned that step to Claude Sonnet on purpose.
For a stretch, agents coming out of the builder were a notch worse than we knew they could be. Vaguer prompts, weaker instructions, the occasional missed constraint. Nothing errored. Nothing alerted. It was the kind of regression you normally only catch when a human notices the output “feels off” — which is to say, the kind most teams never catch at all.
The log says what the source can’t
Reading the agent’s source tells you nothing here, because the source is correct: the agent’s profile says model: claude-sonnet-4-5. To see the problem you have to look at what the call actually did, not what it was configured to do.
Because Agentic Control Plane sits between the agent and every tool it calls, every call — including every model call — lands in an append-only log with the real model on the row. Pulling the audit for compose_prompt, the rows didn’t match the config:
{
"ts": "2026-06-14T09:42:18.221Z",
"agentName": "Agent Builder",
"agentRunId": "ar_91be…",
"agentTier": "background",
"tool": "chat.completion",
"ok": true,
"latencyMs": 1840,
"proxy": {
"model": "gemini-2.5-flash", // ← configured for claude-sonnet-4-5
"provider": "gemini",
"finishReason": "stop"
}
}
ok: true. The call succeeded. It returned a perfectly valid system prompt — just a weaker one, written by a cheaper model. A model downgrade doesn’t fail. It looks exactly like working. That is what makes it invisible to everything except a log that records which model actually ran.
The regression only existed where nobody was watching
The builder runs in more than one place. There’s the interactive path — you, in the builder, watching it work — and then the paths that do the real volume: it runs on a schedule, it runs when something triggers it, it runs when someone replies to one of its emails. Nobody watches those.
Each of those run paths had grown its own model-resolution logic, and instead of honoring the agent’s configured model, they checked it against a list meant for interactive chat. When the configured model — Sonnet — wasn’t on that list, the path silently fell back to the cheap default. The interactive path we stared at during development was fine. The unattended paths, the ones doing the actual work, had been quietly downgraded.
So the agent looked great every time we tested it by hand and was worse every time it ran on its own. The only artifact that could tell those two apart was the log, because the log records the model per call, per run path, per agent — not the model you intended, the one that ran.
The fix, and the habit worth keeping
The fix was to route every path — interactive, scheduled, triggered, reply — through one resolver that starts from the agent’s configured model and only falls back if that model’s provider has no key. One source of truth instead of four divergent ones. After the change, compose_prompt runs Sonnet everywhere, and the log says so.
But the fix is less interesting than the practice it points to: your agent’s audit log is its performance record. The wrong model is one silent regression; there are others that also never throw —
- a tool call that gets denied by policy, which the agent quietly retries every run instead of routing around it
- a step that fails and gets swallowed by a try/catch, degrading the result without surfacing
- latency creep on one step that drags out the whole run
None of these raise an exception. None page anyone. They just make the agent a little worse, indefinitely. You find them the same way we found the downgrade — by reading the record of what the agent actually did, call by call, with the decision, the model, and the latency on every row.
The question worth asking about your own agents: if one of them got quietly worse tomorrow — no crash, no alert — would anything tell you? That’s the logs pillar of ACP: every tool call your agents make, what ran, what it returned, what it cost, on one readable trail. Ours had the answer before we knew to ask.
See every tool call your agents make →