How to Rate-Limit an MCP Server (Per-User, Per-Tool, Per-Agent)

David Crowe · April 16, 2026 · 7 min read

mcp rate-limiting runaway-agents per-user governance

MCP servers have a runaway-agent problem. An agent loops. The loop calls your MCP tool 3,000 times in five minutes. Your database catches fire. Your LLM bill spikes. Your ops team pages you at 2am. By the time you correlate the logs, the damage is done.

The fix sounds simple — “add rate limits” — but MCP’s transport model makes the obvious approach useless. Every call from ChatGPT, Claude Desktop, or Cursor to your MCP server arrives with the same service API key. Your server can’t tell Alice’s 10 calls from Bob’s 2,990. Per-IP rate limits don’t help because the LLM runtime is one IP. Per-API-key limits treat the entire LLM as one caller.

What you actually need is three axes of rate limiting: per user, per tool, per agent. This post walks through the why, the implementation, and the specific shape of the problem each axis solves.

Why the default fails

Here’s a typical MCP server:

from mcp.server import Server, stdio_server
from mcp.types import Tool, TextContent

server = Server("crm-mcp")

@server.list_tools()
async def list_tools() -> list[Tool]:
    return [Tool(name="search_customers", inputSchema={...}, description="...")]

@server.call_tool()
async def call_tool(name: str, arguments: dict) -> list[TextContent]:
    if name == "search_customers":
        return [TextContent(type="text", text=crm.search(arguments["query"]))]

No rate limits. The LLM runtime holds one credential. Your server processes whatever arrives, as fast as arrives. If a prompt loop produces 1,000 tool calls, you execute 1,000 tool calls.

Adding a naïve @rate_limit(per_minute=100) decorator is tempting. It fails four ways:

It sees the LLM runtime’s connection, not the user. If Alice is making 10 calls/min and Bob is making 90, the decorator can’t tell them apart. Bob’s next call fails but so does Alice’s.
It can’t distinguish tools. search_customers might be cheap. generate_report might cost $5 and take 30 seconds. A single per-server limit treats them as equals.
It doesn’t understand agent identity. An Explore subagent in Claude Code scanning your codebase has very different rate-limit needs from an interactive user typing queries.
It runs inside your server process. If your server is replicated across 3 pods, each pod’s counter is independent — total rate is 3× the limit. Distributed rate limiting requires a shared store.

You need rate limits at the three axes MCP traffic actually varies along: user, tool, agent.

The three axes

User axis

“Alice can call search_customers 100 times/hour; Bob can call it 10 times/hour.”

Requires the MCP server to know who the user is. MCP’s transport doesn’t carry user identity by default — it carries the LLM runtime’s shared credential. The fix is a governance layer that verifies the user’s JWT (passed through the transport as an auth header) and attributes every tool call to the verified user identity.

Tool axis

“generate_report is capped at 5 invocations/minute across all users combined, because it hits an upstream LLM API with a global quota.”

Requires per-tool counters, not per-server. Different tools have different blast radius and different upstream cost shapes — rate-limit them independently.

Agent axis

“The Claude Code Explore subagent gets 500 tool calls/hour. An interactive Claude Desktop user gets 50 tool calls/hour. A scheduled workflow agent gets 10,000 tool calls/hour.”

Requires detecting and naming the agent making each call. client + tier + name identifies the agent type. A subagent on a 10-file sweep has legitimately higher volume than a human chatting; you want different limits, not a shared one that either throttles legitimate sweeps or fails to catch runaways.

Implementing all three — the governance gateway pattern

You have two options: build this into your MCP server, or put a governance gateway in front of it.

Option A — build it yourself

Extract the user’s JWT from the MCP transport headers.
Validate the JWT (fetch JWKS from your IdP, verify RS256 signature, check audience/issuer/expiry).
Extract sub for the user axis, client / tier / name from a custom header for the agent axis, tool name from the MCP tools/call request for the tool axis.
Use a shared store (Redis, Memcached) to maintain counters across server replicas.
Apply the three-axis rate limit, returning 429-equivalent MCP errors when any limit is exceeded.
Emit a structured audit log so you can see which axis threw.

This is about 400–800 lines of code, requires a Redis dependency, and you’ll rewrite it every time you add a new tool, a new agent type, or a new IdP.

Option B — route through ACP

Agentic Control Plane (ACP) implements exactly this as a managed gateway. Your MCP server sits behind it. ACP enforces the three-axis rate limits before forwarding to your server.

# Register your MCP server with ACP.
curl -X POST https://api.agenticcontrolplane.com/yourslug/admin/mcpServers \
  -H "Authorization: Bearer $ACP_API_KEY" \
  -d '{"name": "crm", "url": "https://crm-mcp.yourco.com", "scopes": ["crm.read"]}'

# Set per-tool workspace rate limits.
curl -X PUT https://api.agenticcontrolplane.com/yourslug/admin/workspacePolicy \
  -H "Authorization: Bearer $ACP_API_KEY" \
  -d '{
    "tools": {
      "search_customers": { "api": { "rateLimit": 100, "window": "1h" } },
      "generate_report":  { "api": { "rateLimit": 5,   "window": "1m" } }
    }
  }'

# Set per-user overrides (user axis).
curl -X PUT https://api.agenticcontrolplane.com/yourslug/admin/userPolicies/alice@acme.com \
  -H "Authorization: Bearer $ACP_API_KEY" \
  -d '{ "defaults": { "api": { "rateLimit": 200, "window": "1h" } } }'

# Set per-agent-type limits (agent axis).
# Key format: "${client}::${tier}::${name}"
curl -X PUT "https://api.agenticcontrolplane.com/yourslug/admin/agentTypePolicies/Claude%20Code%3A%3Asubagent%3A%3AExplore" \
  -H "Authorization: Bearer $ACP_API_KEY" \
  -d '{ "defaults": { "subagent": { "rateLimit": 500, "window": "1h" } } }'

Every MCP call now passes through ACP’s rate-limit engine, which intersects the three axes and returns the most-restrictive decision. If Alice is hitting her user-axis limit, her next call is denied with a rate_limited response and a structured audit row showing which axis fired.

What denials look like

When an agent hits a limit, ACP returns an MCP-compatible error response:

{
  "error": {
    "code": "rate_limited",
    "message": "Tool search_customers rate limit exceeded for user alice@acme.com.",
    "data": {
      "axis": "user",
      "limit": 100,
      "window": "1h",
      "retry_after_seconds": 847
    }
  }
}

The axis field is the critical detail — it tells you which limit fired, so you can tune the right one. If axis: "tool" is firing for everyone, raise the tool cap. If axis: "user" is firing for one noisy customer, raise just their override.

The runaway loop scenario, resolved

Back to the opening problem: the agent loops and fires 3,000 tool calls.

With three-axis limits configured, what happens:

Call 1 through Call N, where N = the tool or user limit — all succeed.
Call N+1 — user-axis limit exceeded. ACP returns rate_limited.
The agent, seeing the error, either (a) breaks out of the loop, (b) propagates the error, or (c) keeps hammering. Either way your database is untouched past call N.
Your audit log shows: user=alice tool=search_customers decision=rate_limited axis=user count=N+50.
You page yourself, calmly, tomorrow morning.

Without three-axis limits, you page yourself at 2am and write an incident retrospective instead.

What this does not do

Rate limits are one of five governance layers. They handle too much, too fast. They don’t handle:

Too much, too expensive — that’s the budget layer. Budget propagation through delegation chains is an ADCS concern.
Wrong user, right tool — that’s the permission layer (per-user allow/deny rules).
Right user, wrong data — that’s PII detection on tool arguments and outputs.
Right call, wrong hop — that’s scope intersection in multi-agent delegation.

ACP ships all five. The three-axis governance page walks through how they compose.

How to add per-user authentication to a LangGraph agent — companion post on the identity side.
Three-axis governance overview — the full tool / agent / user ABAC model.
MCP rate-limit blast radius — what happens in production when you don’t rate-limit.
ACP Cloud free tier — 5,000 governed calls/month, rate limits configurable per tool / user / agent.

Share: Twitter LinkedIn

← back to blog