Skip to content
agentic control plane Beta

The Cost of Running Agents Without Budget Controls

2026-01-02

so you deploy an agent to production. it works great. then one tuesday morning, you wake up to a slack notification that your LLM spend hit $8,000 before breakfast. you didn’t sell anything extra. you didn’t scale to 10x traffic. one agent just… went sideways.

this is scarier than it sounds because agents don’t make one call. they make dozens. maybe hundreds. each reasoning step, each API retry, each tool invocation—that’s another call. that’s another cost.

how agents burn money fast

a bug creeps in. maybe your agent tries to call an external API that’s temporarily down, doesn’t catch the error properly, and retries in a tight loop. that’s 100 calls per second, each one going to your LLM provider. congratulations, you’ve spent $2,000 in 30 seconds.

or the agent hallucinates. it thinks it needs to do something, tries it, fails, and the error message tells it to try again with more context. so it rebuilds the context window, adds more detail, tries again. each iteration the context gets longer. longer context = exponentially more expensive calls.

then there’s the meta-agent problem: your agent spawns sub-agents to parallelize work. one of those sub-agents spawns more sub-agents. nobody’s watching the tree, and suddenly you’ve got 1,000 concurrent agents all making calls because nobody told them to stop.

i know someone who saw this happen. they spent their entire quarterly LLM budget in 43 minutes. it’s the kind of story that gets told at team standup with this uneasy laugh that everyone recognizes.

why rate limiting isn’t enough

okay, so you slap a rate limiter on it. “max 100 requests per second per user,” you say confidently.

that helps. sort of. but rate limiting counts requests, not cost. a single call to a large model with a massive context window can cost $1 or more. one expensive call isn’t blocked by a rate limiter that’s designed around request volume. you’re protecting yourself against spamming, not against bankruptcy.

you need something that understands money.

what actually works: budget-aware controls

this is where things get practical. you need three things working together:

pre-flight budget checks. before your agent even makes a call, you ask: “do i have enough budget left in my allocation?” if not, the call doesn’t happen. no surprises. no accidentally burning through next quarter’s budget.

sliding-window rate limits that are aware of cost. not just “requests per second,” but budget per second, per user, per tenant, per tool. you can say “this user gets $10 per hour” and actually enforce it. when they’re at $9.87, the next expensive call doesn’t go through.

agent guard that detects loops. repeated patterns mean trouble. if your agent keeps trying the same thing over and over with slight variations, that’s a signal. you can catch it before the loop eats your budget.

GatewayStack’s limitabl module was built exactly for this. pre-flight checks happen before the expensive call. sliding windows work across all your dimensions—user, tenant, tool, model. and agent guard watches for the patterns that usually indicate something went wrong.

the key insight

rate limiting is about request volume. budget control is about actual spend. a single expensive model call can cost more than a thousand cheap ones. if you’re serious about running agents safely, you need to track estimated cost before each call, not just count requests after they’ve happened.

the cost of a runaway agent without controls? somewhere between “really annoying” and “career-limiting.” the cost of adding budget controls? basically nothing.

so here’s my question for you: if you’re running agents right now, do you actually know what the maximum possible spend could be in any given hour? and if you don’t know that number, should you be running them?

← back to writing