Skip to content
Agentic Control Plane

Stop your AI agent from burning through your API budget — in three steps

David Crowe · 5 min read
governance defense-in-depth rate-limiting budget-caps

A solo developer posted on the Cursor forum that their Cursor agent fell into a loop when context was summarized — burning $135 of credits in a week before they noticed. A Codex CLI issue titled “Sub-agent costs are too high and too opaque” documents users running $350 over their Pro plan in a single week. A consultant in Australia woke up to an $18,000 GCP bill after a leaked Gemini key fielded 60,000 unauthorized requests, blowing past a $1,400 spending cap that was supposedly hard.

The pattern: an agent gets stuck, or a credential gets reused, and the bill arrives later. The provider’s “spending cap” turns out to be advisory, or to apply at a granularity (monthly, per-account) that doesn’t catch a runaway loop running for hours.

If you’re running agents and you don’t have an external rate limit, your only stopgap is “I notice in time” — and devs reliably don’t.

Why the provider’s caps don’t save you

Most AI providers and cloud platforms have some form of spending limit. They share a few weaknesses:

  • They evaluate at coarse intervals. Stripe-style usage billing settles every minute or hour, not per-call. A loop can produce thousands of calls between settlement boundaries.
  • They apply at the account level, not the agent level. If you have ten agents, the cap caps all of them collectively. One runaway agent can burn the budget for the rest.
  • They don’t distinguish tiers. A $5/hour interactive Cursor session is fine. A $5/hour cron-scheduled background agent is a problem. The provider’s cap doesn’t know the difference.
  • They’re “advisory” more often than not. Google Cloud’s spending cap is famously a notification, not a hard limit — an attacker can run past it for hours before the account is suspended.

Your defense isn’t the provider’s cap. Your defense is a control plane between your agent and the provider’s API.

Three steps that put a budget gate between your agent and the bill

Step 1 — Install the hook

For Cursor, Claude Code, or Codex CLI:

curl -sf https://agenticcontrolplane.com/install.sh | bash

Every tool call the agent makes — including the LLM round-trips for non-proxied calls and the underlying API calls for tool dispatch — goes through ACP’s hook. The hook tracks per-agent, per-tier call counts and token spend in a local in-memory window, and ACP’s gateway aggregates fleet-wide.

Step 2 — Set per-tier rate limits and call budgets

In your dashboard (cloud.agenticcontrolplane.com) → Policies:

{
  "mode": "enforce",
  "defaults": {
    "background": {
      "rateLimit": { "calls": 30, "window": "1m" },
      "tokenBudget": { "max": 500000, "window": "1d" },
      "callBudget":  { "max": 200,    "window": "1h" }
    },
    "interactive": {
      "rateLimit": { "calls": 120, "window": "1m" },
      "tokenBudget": { "max": 2000000, "window": "1d" }
    },
    "subagent": {
      "rateLimit": { "calls": 20, "window": "1m" },
      "callBudget": { "max": 50,  "window": "10m" },
      "comment": "subagents are spawned by other agents — tightest limits"
    }
  }
}

The semantic: a background-tier agent has a sliding-window cap of 30 calls per minute, 200 calls per hour, and 500K tokens per day. If a loop produces a 31st call inside a one-minute window, the 31st call is denied. The agent sees a tool_error: rate_limited and adapts (or aborts).

For the subagent tier specifically, you can be tighter — subagents are spawned by other agents and are the most common source of runaway fan-out.

Step 3 — Set a hard fleet-wide budget alarm

For the “$18,000 GCP bill” scenario, you don’t just want per-agent rate limits — you want a fleet-wide budget threshold that triggers a hard kill switch:

{
  "fleetBudget": {
    "monthlySpendCap": "$500",
    "alertThresholds": ["$100", "$250", "$400"],
    "onBreach": {
      "action": "deny_all_tool_calls",
      "scope": "tenant",
      "until": "manual_reset"
    }
  }
}

When the cumulative spend across all agents in your tenant hits $500, every tool call is denied until you manually reset. Alerts fire at $100, $250, $400 so you know it’s coming.

This is the layer that catches the GCP-style “leaked key fields 60,000 requests” scenario. Even if an attacker has a working credential, they can’t spend more than the fleet cap.

(Free fourth step) — Spend audit + alerts

Every governed call writes spend to your activity log: agent identity, tier, token count, estimated cost. The dashboard groups by agent + tier + time window, so you can see at a glance “agent X has used 80% of its daily token budget at 11am — concerning.” Set up alerts for sudden spend deltas (5x over rolling baseline) so the runaway loop wakes you up at hour one, not hour twelve.

The total time investment

  • One curl command (Step 1): ~30 seconds
  • Per-tier rate limit + budget config (Step 2): ~3 minutes
  • Fleet-wide budget cap (Step 3): ~2 minutes

Five to six minutes from blank slate to “an autonomous agent in this environment cannot exceed N calls per minute, M tokens per day, or $X total spend without explicit override.”

The asymmetry between five minutes of setup and an $18,000 surprise bill is large. The harder part is convincing yourself that the runaway loop is eventually going to happen. The forums and GitHub issues say it already has, in environments that look a lot like yours.

AgenticControlPlane.com

Get the next post
Agentic governance, AgentGovBench updates, the occasional incident post-mortem. One email per post. No marketing fluff.
Share: Twitter LinkedIn
Related posts

← back to blog