Full scorecard: seven frameworks, 48 scenarios, one open benchmark
tl;dr
Two weeks of benchmark work, seven AI agent frameworks, 48 scenarios per framework, two configurations each (native and ACP-paired). All deterministic, all reproducible, all live against the production ACP gateway.
The bottom line:
| Framework | Pattern | Native | + ACP |
|---|---|---|---|
| CrewAI | Decorator (Python) | 13/48 | 40/48 |
| LangChain / LangGraph | Decorator (Python) | 13/48 | 40/48 |
| Cursor | MCP | 13/48 | 37/48 |
| Claude Code | Hook (CLI) | 13/48 | 43/48 |
| OpenAI Codex CLI | Hook (CLI) | 13/48 | 43/48 |
| OpenAI Agents SDK | Proxy | 13/48 | 45/48 |
| Anthropic Agent SDK | TS handler-wrapper | 13/48 | 46/48 ⭐ |
Two findings worth pulling out:
- Every framework natively scores 13/48 — same as
vanilla(no governance at all). Not a runner artifact: every framework’s bare default emits no audit and enforces no policy. - Pattern shape determines the score, not the framework or the gateway. Decorator → ~40, MCP → varies, Hook → 43, Proxy → 45, TS-wrapper → 46. Same
/govern/tool-useendpoint behind all of them.
Full per-category breakdown, methodology check, and reproducibility instructions below.
The complete table
Reading guide: every cell is <passed>/<total> for that category. Bold = best in class for that category. The 48-scenario total reflects the sum across 8 categories × 6 scenarios each.
| Category | vanilla | audit-only | crewai_acp | langgraph_acp | cursor_acp | claude_code_acp | codex_acp | openai_agents_acp | anthropic_agent_sdk_acp | acp (direct) |
|---|---|---|---|---|---|---|---|---|---|---|
| Audit completeness | 1/6 | 5/6 | 6/6 | 6/6 | 6/6 | 6/6 | 6/6 | 6/6 | 6/6 | 6/6 |
| Cross-tenant isolation | 4/6 | 4/6 | 4/6 | 4/6 | 4/6 | 4/6 | 4/6 | 4/6 | 4/6 | 4/6 |
| Delegation provenance | 0/6 | 5/6 | 2/6 | 2/6 | 4/6 | 6/6 | 6/6 | 6/6 | 6/6 | 6/6 |
| Fail-mode discipline | 3/6 | 4/6 | 6/6 | 6/6 | 3/6 | 4/6 | 4/6 | 6/6 | 6/6 | 6/6 |
| Identity propagation | 0/6 | 6/6 | 6/6 | 6/6 | 5/6 | 6/6 | 6/6 | 6/6 | 6/6 | 6/6 |
| Per-user policy enforcement | 1/6 | 1/6 | 6/6 | 6/6 | 5/6 | 6/6 | 6/6 | 6/6 | 6/6 | 6/6 |
| Rate-limit cascade | 3/6 | 3/6 | 6/6 | 6/6 | 4/6 | 5/6 | 5/6 | 5/6 | 6/6 | 5/6 |
| Scope inheritance | 1/6 | 1/6 | 4/6 | 4/6 | 6/6 | 6/6 | 6/6 | 6/6 | 6/6 | 6/6 |
| Total | 13/48 | 29/48 | 40/48 | 40/48 | 37/48 | 43/48 | 43/48 | 45/48 | 46/48 | 45/48 |
What the patterns reveal
The same governance backend produces meaningfully different scores depending on the integration pattern. This is the single most important finding from the series.
Decorator pattern (~40/48): CrewAI, LangChain/LangGraph wrap individual tool functions. Identity, audit, per-user policy, rate limits all clean. Loses on delegation_provenance (2/6) and scope_inheritance (4/6) because the decorator doesn’t see framework orchestration (Hierarchical Process handoffs in CrewAI, StateGraph state mutations in LangGraph). Both fixes are in the SDK 0.2.0 roadmap.
MCP pattern (37/48): Cursor only governs tools exposed through the MCP server. Internal IDE tools (Edit, Read, Bash) bypass MCP entirely — structural gap of the MCP integration shape, not an ACP gap.
Hook pattern (43/48): Claude Code, Codex CLI. Best on delegation_provenance and scope_inheritance because the host’s hook payload natively carries chain context. Loses one scenario on fail_open_honored because hooks are fail-closed by design (Anthropic and OpenAI both chose safety over availability for their CLIs).
Proxy pattern (45/48): OpenAI Agents SDK. The proxy sits at the natural request-serialization boundary; sees the full HTTP envelope. Same fail-closed design choice as the hook pattern.
TS handler-wrapper (46/48): Anthropic Agent SDK. Highest score. Single-agent loop, native dispatch boundary, both fail modes honored. The cleanest integration shape for governance because there’s the least framework abstraction between the wrapper and the actual tool execution.
The vanilla floor — every framework scores 13/48
Worth pausing on this. Seven frameworks, every single one at 13/48 native. That’s identical to the vanilla runner (no governance at all).
This isn’t suspicious — it’s the actual structural reality. No framework’s tool dispatch emits structured audit data without explicit callback wiring. No framework’s tool dispatch enforces per-user policy or rate limits without explicit governance integration.
The categories that vanilla “passes” (4/6 on cross-tenant, 3/6 on rate-limit, etc.) pass because those scenarios assert benign baseline behavior — a tool call that should be allowed is allowed. The categories vanilla fails (0/6 on identity propagation, 0/6 on delegation provenance) require enforcement that no framework provides natively.
If you’re building agents in any of these frameworks and you’ve heard “framework X has audit,” that audit is conditional on you wiring a callback handler. Out of the box: vanilla.
The 16-scenario gap — where governance products earn their keep
The jump from vanilla (13/48) to audit_only (29/48) is what a logging library can buy you: every call captured with attribution, provenance, timestamp. Nothing denied. Nothing rate-limited.
The jump from audit_only (29/48) to ACP at 40-46/48 is enforcement. Sixteen scenarios that ask “did the system actually stop the bad thing, not just record it?” That’s the meaningful business value of a governance product over a logging library.
The variance among ACP-paired runners (40 to 46) is the integration tax — how much context gets lost in the wrapper-to-gateway path. Best case: the wrapper sits at the native dispatch boundary (TS handler-wrapper, 46/48). Worst case: the wrapper sees individual tools but misses framework orchestration (decorator pattern, 40/48). Both are real, both are honest.
Three documented declinations
ACP doesn’t pass three scenarios. They’re documented in every result file and every per-framework post:
cross_tenant_isolation.03_user_scope_does_not_leakand.05_admin_cannot_cross— gateway code is shipped, awaiting Cloud Run flip to multi-tenant deployment modescope_inheritance.04_task_narrowing— SDK-side intent-aware enforcement, not a gateway concern by design
Per-framework declinations layer on top:
fail_open_honoreddeclined for Claude Code, Codex CLI, OpenAI Agents SDK (all fail-closed by design)per_user_policy_enforcement.03declined for runner-side reasons
These are in the scorecard JSON. Not hidden, not glossed over. A benchmark only earns credibility by publishing its own gaps.
How to verify any of this on your stack
git clone https://github.com/agentic-control-plane/agentgovbench
cd agentgovbench
python -m venv .venv && source .venv/bin/activate
pip install -e .
# Install the framework you want to benchmark
pip install crewai acp-crewai # or whichever
# or for TS frameworks: npm install in a separate workspace
# Point at YOUR ACP project (Firebase service account, see README)
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/firebase-creds.json
export AGB_TENANT_ID=...
export FIREBASE_WEB_API_KEY=...
# Run
python -m benchmark.cli run --runner crewai_acp --out my-result.json
If you see different numbers than ours, that’s either version drift on your ACP deployment or a gap we haven’t found. Either way, please file an issue at the agentgovbench repo.
What’s next
- NIST AI RMF 1.0 mapping post — how the 48 scenarios map to specific control families, for procurement teams
- Acp-crewai@0.2.0 + acp-langchain@0.2.0 — closing the chain-context gap. Decorator-pattern scores should rise from 40 → 44.
- First competitor runner — Guardrails AI, Credo AI, or NeMo Guardrails. PRs welcome.
- Cursor scorecard deep-dive — the MCP integration boundary deserves its own post.
This series will continue at the cadence of one major drop every 2-3 days for the next month. The benchmark page at /benchmark is the canonical scorecard — bookmark it.
Receipts:
- agentgovbench repo — all runners + results
- /benchmark page — live, sortable, always current
- Methodology post — how the benchmark is designed
- Decorator vs proxy vs hook
- All per-framework scorecards: CrewAI · LangGraph · Claude Code · Codex CLI · OpenAI Agents SDK · Anthropic Agent SDK
- 1. How we think about testing AI agent governance
- 2. CrewAI scores 13/48 on AgentGovBench. With ACP, 40/48.
- 3. CrewAI's task handoffs lose the audit trail — here's the gap and the fix
- 4. LangGraph scores 13/48 on AgentGovBench. With ACP, 40/48.
- 5. LangGraph's StateGraph checkpoints don't replay through governance
- 6. Claude Code scores 13/48 on AgentGovBench. With ACP, 43/48.
- 7. Claude Code's --dangerously-skip-permissions disables every governance hook
- 8. Decorator, proxy, hook — three patterns for agent governance, three different scorecards
- 9. OpenAI Agents SDK scores 13/48 on AgentGovBench. With ACP, 45/48.
- 10. Anthropic Agent SDK scores 13/48 on AgentGovBench. With ACP, 46/48 — best of any framework.
- 11. Codex CLI scores 13/48 on AgentGovBench. With ACP, 43/48 — same as Claude Code.
- 12. Full scorecard: seven frameworks, 48 scenarios, one open benchmark · you are here
- 13. How AgentGovBench's 48 scenarios map to NIST AI RMF 1.0
- 14. Reproduce AgentGovBench on your stack — full setup guide
- 15. Cursor scores 13/48 on AgentGovBench. With ACP MCP server, 37/48 — and that gap is structural.
- 16. Recommended governance deployment patterns — pick the one that scores highest for your stack
- 17. What our benchmark told us about our own product — six fixes we're shipping