AI control plane: a buyer's guide
“AI control plane” is the term every governance vendor wants to own this year. It’s also a term most of them are using to describe products that solve a fraction of the problem — a policy engine that doesn’t see runtime calls, an observability dashboard that doesn’t enforce, an LLM gateway that classifies tokens but not tools.
If you’re evaluating one of these for production agent traffic, the difference between vendors isn’t marginal. It determines whether you have an enforceable policy layer or a logging shim with a nicer UI.
This is the buyer’s guide for the question. What an AI control plane is, what it isn’t, the four vendor categories competing for the name, the questions that separate them, and a 14-day evaluation framework you can run on real traffic before you sign anything.
What an AI control plane actually is
An AI control plane is the trust, identity, and authorization layer between the model’s decision to call a tool and the tool actually being called. Three properties make it a control plane and not just a logger:
- Inline. It’s in the request path. The agent cannot bypass it without disabling itself.
- Decisive. It returns allow / deny / modified-input on every call. Not “we’ll alert you later.”
- Identity-aware. It evaluates each call against the verified identity of the human (or system) on whose behalf the agent is running, not against a shared API key.
Tools that don’t have all three — observability platforms, LLM-traffic dashboards, post-hoc anomaly detectors — can be useful, but they’re not control planes. They are the X-ray; the control plane is the airport screening.
The four vendor categories
Vendors marketing in this space tend to fall into one of four shapes. The shape determines which problems they actually solve.
1. LLM gateways
What they do: Sit between your application and one or more LLM providers. Route, cache, fallback, log token usage, sometimes attach a moderation layer.
Examples: Portkey, Helicone, OpenRouter, LiteLLM, Cloudflare AI Gateway.
What they govern: Which model gets called, how much it costs, whether the prompt contains forbidden content.
What they don’t govern: Which tools the agent calls afterward. Once the model returns “use tool X with input Y,” the gateway is no longer in the path — the agent calls the tool directly.
When this is enough: You’re running a chat assistant that doesn’t have tool access, or whose tool access is to a single internal API you already gate elsewhere.
When it isn’t: Any agentic deployment where the model can call shell, MCP servers, file system, browser, payment APIs, or anything you wouldn’t hand a stranger the keys to.
2. Policy engines (OPA, Cedar, Casbin)
What they do: Express and evaluate authorization rules as code or data. Receive a query (subject, action, resource), return allow/deny.
Examples: Open Policy Agent, AWS Cedar, Casbin, Styra.
What they govern: Whatever you call them about. They are libraries, not products. The control plane uses a policy engine; it isn’t one.
What they don’t govern: Anything you don’t ask. They have no opinion on what events to evaluate, how to capture identity, what to log, or how to handle remediation. You build that wrapper.
When this is enough: You’re a platform team building bespoke governance for one specific system.
When it isn’t: You want a turnkey deployment that ships with a tool taxonomy, an audit schema, an identity model, and a UI for non-engineers to edit policy.
3. AI observability platforms
What they do: Trace LLM calls across services, surface latency and cost, visualize prompt/response patterns, sometimes flag drift or jailbreaks.
Examples: LangSmith, Langfuse, Phoenix (Arize), Weights & Biases.
What they govern: Nothing inline. They observe.
What they don’t govern: Anything that requires a decision in the request path. They are excellent for understanding what your agents are doing; they are not the layer that stops them from doing it.
When this is enough: You have a separate enforcement layer and need diagnostic visibility on top.
When it isn’t: You’re shopping for a single vendor to satisfy a “control plane” line item on an architecture diagram.
4. Agentic control planes
What they do: All three properties above. Inline interception of tool calls (via SDK adapter, hook, or proxy), identity-attributed policy evaluation, audit trail, output redaction, and a dashboard for non-engineers to operate the policy.
Examples: Agentic Control Plane (this site), Cloudflare’s emerging agent governance, internal builds at scale shops.
What they govern: Every tool call from every governed agent, with the verified identity of the user attached.
What they don’t: Things outside the agent’s tool surface — your underlying APIs still need their own auth, your data warehouse still needs row-level policies. The control plane is additional, not a replacement, for application-level controls.
When this is the right category: Your agents have meaningful tool access, you have or will have a compliance requirement, you don’t want to build the audit-trail and policy-engine integration in-house, and you have more than one agent runtime (Claude Code + Codex CLI + an internal CrewAI agent) to govern uniformly.
The dimensions that actually differ
Within the agentic-control-plane category, vendors differ on six dimensions that matter operationally.
Integration footprint
How does the control plane intercept calls?
- SDK adapter —
importa wrapper, get governed calls. Lowest cognitive load; only works for SDKs you’ve adopted. - Hook integration — register a hook in the agent runtime’s config (Claude Code, Codex CLI, Cursor). Lowest code change; only works for runtimes that have a hook surface.
- Proxy — point the agent at a
baseURLyou control. Works for anything; introduces a network hop. - MCP gateway — point MCP-aware agents at a governed MCP endpoint. Works for the growing MCP ecosystem; doesn’t cover non-MCP tools.
A real control plane supports at least three of these. If a vendor only supports one, your runtime choice is constrained by their integration model.
Tool taxonomy granularity
Does “shell” count as a tool, or does the system distinguish Bash.curl from Bash.rm from Bash.kubectl?
A flat taxonomy (“shell allowed/denied”) is unworkable for any policy more nuanced than “all or nothing.” Insist on sub-command classification for shells, file-class classification for file ops, and server-attribution for MCP calls.
Identity model
Does the control plane support:
- Per-user OAuth/OIDC JWTs as the agent’s identity?
- Multi-IdP (Auth0 + Okta + Google + your own)?
- Service-to-service identity (workload identity, machine accounts)?
- Tier separation (interactive / subagent / background / API)?
A single shared API key is the failure mode the control plane exists to fix. If the vendor’s identity story is “you give us an API key” — that’s the same shape as the problem.
Policy expressiveness
Can policies express:
- Tier-aware rules (deny
Bash.kubectlin background, ask in interactive)? - Resource-aware rules (allow
Readon/repo/**, deny on/repo/.env)? - Identity-aware rules (
role:viewercanRead,role:admincanWrite)? - Output-aware rules (redact PII patterns regardless of tool)?
Policies that are only “allow this tool” or “deny this tool” are insufficient for any real deployment.
Audit trail depth
The ten-question CISO checklist is the exact list. If the vendor can’t answer it fluently in a demo, the audit story is hand-waved.
Operational maturity
Less glamorous, more important:
- Can you self-host? On-prem? Air-gapped?
- Is there a usable on-call story? SLOs?
- Data residency: where do logs live?
- Is the policy bundle versioned and rollback-able?
- Pricing: per-seat, per-call, per-MAU, per-policy-evaluation?
Pricing is downstream of architecture. A per-call price model means the vendor wants you to think hard about every call you govern. A per-seat price model means they want broad deployment. Pick the model that aligns with how you’d want governance to work.
RFP-style questions that separate vendors
Five questions you can drop into a vendor evaluation and get useful answers from:
- “Show me the same agent, governed and ungoverned, side by side. Demo a destructive call. Demo the audit row that resulted.” — exposes whether their integration is real or aspirational.
- “Walk me through your tool taxonomy. How do you distinguish
Bash.curlfromBash.rm? What happens for tools you’ve never seen before — like a custom MCP server we’ll bring?” — exposes how generalizable their classifier is. - “How does the control plane learn the end-user’s identity in a server-deployed agent? Can you show me the request path from JWT verification to policy evaluation?” — exposes whether identity is real or pretended.
- “If your service is unreachable, what happens to my agents? Fail-open, fail-closed, or graceful degradation?” — exposes the dependency profile. None of these are wrong; surprises are wrong.
- “Show me the policy file. Not a screenshot — the actual bytes a customer would commit to git or paste into your console.” — exposes whether policy is editable, versionable, and grep-able.
Don’t accept slide answers. The right answer is “let me show you” and a demo.
A 14-day evaluation framework
If the vendor demo passes, run a real evaluation. Two weeks, with these milestones:
Days 1–2: Wire up. Install the vendor’s integration in one agent runtime — your stickiest one. The cost-of-installation reveals more than any docs page. Time it. Note where you got stuck.
Days 3–5: Baseline. Run your normal agent workload, control plane in observe-only mode. The audit trail accumulates. Now you know: what tools your agents actually call, in what proportion, with what input distributions. This data alone is worth the trial.
Days 6–8: Express your real policies. Take three policies you’ve wanted to enforce — deny Bash.kubectl in background, redact API keys in tool outputs, require interactive tier for payment APIs — and write them. How long does it take? How many vendor-doc tabs do you have open? Can a non-engineer on your team understand them?
Days 9–11: Switch to enforcement. Turn the policies on. Watch what breaks. The legitimate agent workflow that hits a deny is a real signal — sometimes about the policy, sometimes about the workflow. Tune.
Days 12–13: Stress test. Disconnect the control plane (firewall it off). Confirm your agents fail-closed (the right answer for a control plane) — and that the failure mode is operationally tolerable. If they fail-open, that’s a different design and a different conversation with your CISO.
Day 14: Audit-trail walkthrough. Hand the audit trail to someone who plays the auditor role. Have them ask the ten questions. Score the answers.
At the end, you have:
- Real installation cost (hours).
- Real ongoing cost (the per-call price × your actual call volume from days 3–5).
- A list of policies you wrote, evidence they enforce, evidence the audit trail captures the enforcement.
- A failure-mode test result.
That’s enough to make the buy/build/wait decision on actual data.
When to build instead of buy
A control plane is buildable if all five are true:
- You have one agent runtime (not three).
- You have one identity provider (not three).
- You have one policy backend in mind, and a team that already operates it.
- Your compliance scope is narrow enough that a custom audit schema is acceptable to your auditors.
- Engineering cost of build < commercial cost of buy over 18 months, including the maintenance burden of agent-runtime upgrades.
Most teams who attempt this discover failure modes that look small in the design doc and large in production: hooks change between agent-runtime versions, MCP server schemas drift, identity bridges proliferate, the audit schema needs versioning, and the policy DSL slowly becomes a programming language. Three engineers, six months. None of that work is a competitive moat for your business.
The build-vs-buy math has shifted in the last twelve months because the agent-runtime ecosystem expanded and the integration surface grew faster than any one team can keep up with. If you’re entering the market today, the question is which vendor — not whether to buy one.
What to walk away with
A control plane is the inline, decisive, identity-aware enforcement layer between agent decisions and tool actions. Vendors marketed as “control planes” mostly aren’t. The four-category taxonomy and the dimensions-that-differ checklist let you cut through positioning. The 14-day evaluation framework gives you data instead of vibes.
If you want to skip to evaluation: install governed Claude Code, Codex CLI, or Cursor in three minutes, run real traffic, and read your own audit trail at the end of the week.
Get started → · Reference architecture → · Comparison: ACP vs alternatives → · Pricing →