Skip to content
Agentic Control Plane
Benchmark series

AgentGovBench

An open, NIST-mapped benchmark for AI agent governance. 48 scenarios across 8 categories test what every governance layer must enforce: identity propagation, per-user policy, delegation provenance, scope inheritance, rate-limit cascade, audit completeness, fail-mode discipline, cross-tenant isolation.

We benchmarked CrewAI, LangChain/LangGraph, Claude Code, OpenAI Agents SDK, Anthropic Agent SDK, and Cursor — both natively and with ACP. Every result is reproducible against your own deployment.

Posts in this series

Read in order, or skip to the framework or topic that matters most.

  1. 1.
    How we think about testing AI agent governance
    AgentGovBench is an open, NIST-mapped benchmark for AI agent governance. We ran it against ACP. What broke, what shipped, how to run it on your deployment.
  2. 2.
    CrewAI scores 13/48 on AgentGovBench. With ACP, 40/48.
    CrewAI run through 48 governance scenarios, twice. Vanilla: floor. Wrapped in @governed: 40/48. Where the gap sits and what it means in production.
  3. 3.
    CrewAI's task handoffs lose the audit trail — here's the gap and the fix
    CrewAI's Hierarchical Process delegates manager-to-worker without carrying the chain. Even with @governed, audit logs show worker as top-level. The fix.
  4. 4.
    LangGraph scores 13/48 on AgentGovBench. With ACP, 40/48.
    LangChain/LangGraph run through 48 governance scenarios, twice. Same as CrewAI: vanilla floor, jumps once wrapped in @governed. Per-category breakdown.
  5. 5.
    LangGraph's StateGraph checkpoints don't replay through governance
    LangGraph checkpoint replays skip the governance pipeline — policy changes between original run and replay are silently ignored. The failure mode and fix.
  6. 6.
    Claude Code scores 13/48 on AgentGovBench. With ACP, 43/48.
    Third framework. Claude Code sits at the vanilla floor with no PreToolUse hook; with ACP installed, every Bash, Edit, and MCP call is governed.
  7. 7.
    Claude Code's --dangerously-skip-permissions disables every governance hook
    Claude Code's --dangerously-skip-permissions silently disables every PreToolUse and PostToolUse hook, including ACP's. How to detect it server-side.
  8. 8.
    Decorator, proxy, hook — three patterns for agent governance, three different scorecards
    Why CrewAI + ACP scores 40/48 but Claude Code + ACP scores 43/48 on the same backend. Three integration patterns, three scorecards — where each wins.
  9. 9.
    OpenAI Agents SDK scores 13/48 on AgentGovBench. With ACP, 45/48.
    Fourth framework, first proxy-pattern integration. OpenAI Agents SDK scores 45/48 with ACP — closest to pure ACP because the proxy sees the full payload.
  10. 10.
    Anthropic Agent SDK scores 13/48 on AgentGovBench. With ACP, 46/48 — best of any framework.
    We scored Anthropic's Agent SDK against 48 governance requirements — hooks, audit logging, identity, policy enforcement. Vanilla hits 13/48. Here's the gap.
  11. 11.
    Codex CLI scores 13/48 on AgentGovBench. With ACP, 43/48 — same as Claude Code.
    Sixth framework. OpenAI's Codex CLI shares Claude Code's hook protocol, scores 43/48 with ACP. Differentiator: auto mode keeps hooks firing.
  12. 12.
    Full scorecard: seven frameworks, 48 scenarios, one open benchmark
    Seven frameworks benchmarked: CrewAI, LangGraph, Claude Code, OpenAI Agents SDK, Anthropic Agent SDK, Cursor, Codex CLI. Native vs ACP. Three score tiers.
  13. 13.
    How AgentGovBench's 48 scenarios map to NIST AI RMF 1.0
    AgentGovBench scenarios cite specific NIST AI RMF 1.0 controls — MAP, MEASURE, MANAGE, GOVERN. The full mapping for procurement teams citing controls.
  14. 14.
    Reproduce AgentGovBench on your stack — full setup guide
    Step-by-step guide to running the AgentGovBench scorecard against your own ACP deployment: required env, Firebase setup, common issues, reading results.
  15. 15.
    Cursor scores 13/48 on AgentGovBench. With ACP MCP server, 37/48 — and that gap is structural.
    Seventh framework. Cursor's MCP integration only governs MCP-exposed tools — internal Edit/Read/Bash bypass entirely. A structural 37/48 ceiling.
  16. 16.
    Recommended governance deployment patterns — pick the one that scores highest for your stack
    AgentGovBench scores across seven frameworks, translated into a customer-facing recommendation for deploying governed AI agents by stack, score, and reach.
  17. 18.
    Seven agent frameworks, one backend, governance diverges on 9 of 48 tests
    Seven frameworks, one backend, 48 governance scenarios. Scores ranged 37-46. Variance is architectural: where a framework lets you observe tool calls.

Why this series exists

Every AI governance vendor has a feature matrix. Whether the per-user policy actually enforces correctly when the user spawns ten subagents in parallel — that's a test, not a feature. A buyer who trusts feature matrices is buying marketing; a buyer who wants guarantees needs a benchmark.

AgentGovBench takes the same scenarios that procurement teams actually need (mapped to NIST AI RMF 1.0) and runs them against any governance layer you can write a runner for. The scenarios don't know what ACP is. ACP is one runner among several — vanilla (no governance), audit_only (logging without enforcement), acp (full enforcement), and now per-framework runners for every popular AI client and SDK.

The series unpacks each runner’s score, the structural reasons each framework scores where it does, and the specific failure modes worth knowing about — whether or not you use ACP.

Reproduce the scorecard on your stack.

Clone the repo, point at your ACP instance, run the benchmark. Same runner, same scenarios, same numbers. If you see different results, that’s either version drift or a gap we haven’t found.