Benchmark series

AgentGovBench

An open, NIST-mapped benchmark for AI agent governance. 48 scenarios across 8 categories test what every governance layer must enforce: identity propagation, per-user policy, delegation provenance, scope inheritance, rate-limit cascade, audit completeness, fail-mode discipline, cross-tenant isolation.

We benchmarked CrewAI, LangChain/LangGraph, Claude Code, OpenAI Agents SDK, Anthropic Agent SDK, and Cursor — both natively and with ACP. Every result is reproducible against your own deployment.

View the live scorecard → Repo on GitHub →

Posts in this series

Read in order, or skip to the framework or topic that matters most.

1.
How we think about testing AI agent governance
AgentGovBench is an open, NIST-mapped benchmark for AI agent governance. We ran it against ACP. What broke, what shipped, how to run it on your deployment.
3.
CrewAI's task handoffs lose the audit trail — here's the gap and the fix
CrewAI's Hierarchical Process delegates manager-to-worker without carrying the chain. Even with @governed, audit logs show worker as top-level. The fix.
5.
LangGraph's StateGraph checkpoints don't replay through governance
LangGraph checkpoint replays skip the governance pipeline — policy changes between original run and replay are silently ignored. The failure mode and fix.
7.
Claude Code's --dangerously-skip-permissions disables every governance hook
Claude Code's --dangerously-skip-permissions silently disables every PreToolUse and PostToolUse hook, including ACP's. How to detect it server-side.
8.
Decorator, proxy, hook — three patterns for agent governance, three different scorecards
Why CrewAI + ACP scores 40/48 but Claude Code + ACP scores 43/48 on the same backend. Three integration patterns, three scorecards — where each wins.
10.
Does the Anthropic Agent SDK Have Governance?
The Anthropic Agent SDK ships no per-user identity, policy, or audit. Here's the governance gap — and how to close it with one wrapper around your handlers.
12.
Full scorecard: seven frameworks, 48 scenarios, one open benchmark
Seven frameworks benchmarked: CrewAI, LangGraph, Claude Code, OpenAI Agents SDK, Anthropic Agent SDK, Cursor, Codex CLI. Native vs ACP. Three score tiers.
13.
How AgentGovBench's 48 scenarios map to NIST AI RMF 1.0
AgentGovBench scenarios cite specific NIST AI RMF 1.0 controls — MAP, MEASURE, MANAGE, GOVERN. The full mapping for procurement teams citing controls.
14.
Reproduce AgentGovBench on your stack — full setup guide
Step-by-step guide to running the AgentGovBench scorecard against your own ACP deployment: required env, Firebase setup, common issues, reading results.
16.
Recommended governance deployment patterns — pick the one that scores highest for your stack
AgentGovBench scores across seven frameworks, translated into a customer-facing recommendation for deploying governed AI agents by stack, score, and reach.
18.
Seven agent frameworks, one backend, governance diverges on 9 of 48 tests
Seven frameworks, one backend, 48 governance scenarios. Scores ranged 37-46. Variance is architectural: where a framework lets you observe tool calls.

Why this series exists

Every AI governance vendor has a feature matrix. Whether the per-user policy actually enforces correctly when the user spawns ten subagents in parallel — that's a test, not a feature. A buyer who trusts feature matrices is buying marketing; a buyer who wants guarantees needs a benchmark.

AgentGovBench takes the same scenarios that procurement teams actually need (mapped to NIST AI RMF 1.0) and runs them against any governance layer you can write a runner for. The scenarios don't know what ACP is. ACP is one runner among several — vanilla (no governance), audit_only (logging without enforcement), acp (full enforcement), and now per-framework runners for every popular AI client and SDK.

The series unpacks each runner’s score, the structural reasons each framework scores where it does, and the specific failure modes worth knowing about — whether or not you use ACP.

Reproduce the scorecard on your stack.

Clone the repo, point at your ACP instance, run the benchmark. Same runner, same scenarios, same numbers. If you see different results, that’s either version drift or a gap we haven’t found.

Clone on GitHub → View scorecard →