Skip to content
Agentic Control Plane
Benchmark series

AgentGovBench

An open, NIST-mapped benchmark for AI agent governance. 48 scenarios across 8 categories test what every governance layer must enforce: identity propagation, per-user policy, delegation provenance, scope inheritance, rate-limit cascade, audit completeness, fail-mode discipline, cross-tenant isolation.

We benchmarked CrewAI, LangChain/LangGraph, Claude Code, OpenAI Agents SDK, Anthropic Agent SDK, and Cursor — both natively and with ACP. Every result is reproducible against your own deployment.

Posts in this series

Read in order, or skip to the framework or topic that matters most.

  1. 1.
    How we think about testing AI agent governance
    We published AgentGovBench — an open, NIST-mapped benchmark for AI agent governance. Ran it against ACP. Here's what broke, what we shipped, and how to run it against your own deployment.
    April 20, 2026
  2. 2.
    CrewAI scores 13/48 on AgentGovBench. With ACP, 40/48.
    We ran CrewAI through 48 governance scenarios — twice. Out of the box CrewAI scores at the vanilla floor. Wrapped in @governed it jumps to 40/48. Here's where the gap is and what it means for production CrewAI deployments.
    April 20, 2026
  3. 3.
    CrewAI's task handoffs lose the audit trail — here's the gap and the fix
    When CrewAI's Hierarchical Process delegates from a manager to a worker, the worker's tool calls don't carry the chain. Even with @governed wrapping every tool, the audit log shows the worker as a top-level agent. Here's why, and what we're shipping.
    April 20, 2026
  4. 4.
    LangGraph scores 13/48 on AgentGovBench. With ACP, 40/48.
    We ran LangChain/LangGraph through 48 governance scenarios — twice. Same story as CrewAI: out of the box LangGraph is at the vanilla floor. Wrapped in @governed it jumps. Here's the per-category breakdown.
    April 20, 2026
  5. 5.
    LangGraph's StateGraph checkpoints don't replay through governance
    When LangGraph resumes from a checkpoint, no governance pipeline re-runs against the replayed state. Policy changes between original run and replay are silently ignored. Here's the failure mode, the specific scenarios, and what to do.
    April 20, 2026
  6. 6.
    Claude Code scores 13/48 on AgentGovBench. With ACP, 43/48.
    Third framework in the series. Claude Code with no PreToolUse hook is at the vanilla floor. With ACP's hook installed, every Bash, Edit, and MCP call is governed. Here's the per-category breakdown — and the documented escape hatch.
    April 20, 2026
  7. 7.
    Claude Code's --dangerously-skip-permissions disables every governance hook
    When a user runs Claude Code with --dangerously-skip-permissions, every PreToolUse and PostToolUse hook is silently disabled — including ACP's. The audit log goes silent. Here's the gap, what it means, and how to detect it server-side.
    April 20, 2026
  8. 8.
    Decorator, proxy, hook — three patterns for agent governance, three different scorecards
    Why CrewAI + ACP scores 40/48 but Claude Code + ACP scores 43/48 with the same backend. The integration pattern shapes what governance can see — here's the three patterns and where each one wins or loses.
    April 20, 2026
  9. 9.
    OpenAI Agents SDK scores 13/48 on AgentGovBench. With ACP, 45/48.
    Fourth framework in the series — and the first proxy-pattern integration. OpenAI Agents SDK at vanilla floor without governance; with ACP's proxy, 45/48 — closest to pure ACP because the proxy sees the full request payload.
    April 20, 2026
  10. 10.
    Anthropic Agent SDK scores 13/48 on AgentGovBench. With ACP, 46/48 — best of any framework.
    Fifth framework, highest ACP-paired score so far. The Anthropic Agent SDK's TypeScript governHandlers wrapper hits 46/48 — one above OpenAI Agents SDK proxy, three above decorator patterns. Here's why and what to take away.
    April 20, 2026
  11. 11.
    Codex CLI scores 13/48 on AgentGovBench. With ACP, 43/48 — same as Claude Code.
    Sixth framework. OpenAI's Codex CLI uses the same hook protocol as Claude Code — and scores identically (43/48) when ACP's hook is installed. Plus one meaningful governance differentiator: Codex's auto mode keeps hooks firing.
    April 20, 2026
  12. 12.
    Full scorecard: seven frameworks, 48 scenarios, one open benchmark
    We benchmarked seven AI agent frameworks — CrewAI, LangChain/LangGraph, Claude Code, OpenAI Agents SDK, Anthropic Agent SDK, Cursor, OpenAI Codex CLI — both natively and with ACP. Three integration patterns. Three score tiers. Here's the complete picture.
    April 20, 2026
  13. 13.
    How AgentGovBench's 48 scenarios map to NIST AI RMF 1.0
    AgentGovBench scenarios cite specific NIST AI RMF 1.0 controls — MAP, MEASURE, MANAGE, GOVERN. Here's the full mapping for procurement teams that need to cite controls, not vendor claims.
    April 20, 2026
  14. 14.
    Reproduce AgentGovBench on your stack — full setup guide
    Step-by-step guide to running the AgentGovBench scorecard against your own ACP deployment. Required env, Firebase setup, common issues, and how to interpret the results.
    April 20, 2026
  15. 15.
    Cursor scores 13/48 on AgentGovBench. With ACP MCP server, 37/48 — and that gap is structural.
    Seventh framework. Cursor's MCP integration only governs MCP-exposed tools. Internal IDE tools (Edit/Read/Bash) bypass governance entirely. The 37/48 ceiling reflects a structural limit no SDK improvement can close.
    April 20, 2026

Why this series exists

Every AI governance vendor has a feature matrix. Whether the per-user policy actually enforces correctly when the user spawns ten subagents in parallel — that's a test, not a feature. A buyer who trusts feature matrices is buying marketing; a buyer who wants guarantees needs a benchmark.

AgentGovBench takes the same scenarios that procurement teams actually need (mapped to NIST AI RMF 1.0) and runs them against any governance layer you can write a runner for. The scenarios don't know what ACP is. ACP is one runner among several — vanilla (no governance), audit_only (logging without enforcement), acp (full enforcement), and now per-framework runners for every popular AI client and SDK.

The series unpacks each runner’s score, the structural reasons each framework scores where it does, and the specific failure modes worth knowing about — whether or not you use ACP.

Reproduce the scorecard on your stack.

Clone the repo, point at your ACP instance, run the benchmark. Same runner, same scenarios, same numbers. If you see different results, that’s either version drift or a gap we haven’t found.