Skip to content
Agentic Control Plane
Benchmark series · Part 12 of 17
AgentGovBench →

Full scorecard: seven frameworks, 48 scenarios, one open benchmark

David Crowe · · 6 min read
benchmark governance scorecard agentgovbench comparison

tl;dr

Two weeks of benchmark work, seven AI agent frameworks, 48 scenarios per framework, two configurations each (native and ACP-paired). All deterministic, all reproducible, all live against the production ACP gateway.

The bottom line:

Framework Pattern Native + ACP
CrewAI Decorator (Python) 13/48 40/48
LangChain / LangGraph Decorator (Python) 13/48 40/48
Cursor MCP 13/48 37/48
Claude Code Hook (CLI) 13/48 43/48
OpenAI Codex CLI Hook (CLI) 13/48 43/48
OpenAI Agents SDK Proxy 13/48 45/48
Anthropic Agent SDK TS handler-wrapper 13/48 46/48

Two findings worth pulling out:

  1. Every framework natively scores 13/48 — same as vanilla (no governance at all). Not a runner artifact: every framework’s bare default emits no audit and enforces no policy.
  2. Pattern shape determines the score, not the framework or the gateway. Decorator → ~40, MCP → varies, Hook → 43, Proxy → 45, TS-wrapper → 46. Same /govern/tool-use endpoint behind all of them.

Full per-category breakdown, methodology check, and reproducibility instructions below.

The complete table

Reading guide: every cell is <passed>/<total> for that category. Bold = best in class for that category. The 48-scenario total reflects the sum across 8 categories × 6 scenarios each.

Category vanilla audit-only crewai_acp langgraph_acp cursor_acp claude_code_acp codex_acp openai_agents_acp anthropic_agent_sdk_acp acp (direct)
Audit completeness 1/6 5/6 6/6 6/6 6/6 6/6 6/6 6/6 6/6 6/6
Cross-tenant isolation 4/6 4/6 4/6 4/6 4/6 4/6 4/6 4/6 4/6 4/6
Delegation provenance 0/6 5/6 2/6 2/6 4/6 6/6 6/6 6/6 6/6 6/6
Fail-mode discipline 3/6 4/6 6/6 6/6 3/6 4/6 4/6 6/6 6/6 6/6
Identity propagation 0/6 6/6 6/6 6/6 5/6 6/6 6/6 6/6 6/6 6/6
Per-user policy enforcement 1/6 1/6 6/6 6/6 5/6 6/6 6/6 6/6 6/6 6/6
Rate-limit cascade 3/6 3/6 6/6 6/6 4/6 5/6 5/6 5/6 6/6 5/6
Scope inheritance 1/6 1/6 4/6 4/6 6/6 6/6 6/6 6/6 6/6 6/6
Total 13/48 29/48 40/48 40/48 37/48 43/48 43/48 45/48 46/48 45/48

What the patterns reveal

The same governance backend produces meaningfully different scores depending on the integration pattern. This is the single most important finding from the series.

Decorator pattern (~40/48): CrewAI, LangChain/LangGraph wrap individual tool functions. Identity, audit, per-user policy, rate limits all clean. Loses on delegation_provenance (2/6) and scope_inheritance (4/6) because the decorator doesn’t see framework orchestration (Hierarchical Process handoffs in CrewAI, StateGraph state mutations in LangGraph). Both fixes are in the SDK 0.2.0 roadmap.

MCP pattern (37/48): Cursor only governs tools exposed through the MCP server. Internal IDE tools (Edit, Read, Bash) bypass MCP entirely — structural gap of the MCP integration shape, not an ACP gap.

Hook pattern (43/48): Claude Code, Codex CLI. Best on delegation_provenance and scope_inheritance because the host’s hook payload natively carries chain context. Loses one scenario on fail_open_honored because hooks are fail-closed by design (Anthropic and OpenAI both chose safety over availability for their CLIs).

Proxy pattern (45/48): OpenAI Agents SDK. The proxy sits at the natural request-serialization boundary; sees the full HTTP envelope. Same fail-closed design choice as the hook pattern.

TS handler-wrapper (46/48): Anthropic Agent SDK. Highest score. Single-agent loop, native dispatch boundary, both fail modes honored. The cleanest integration shape for governance because there’s the least framework abstraction between the wrapper and the actual tool execution.

The vanilla floor — every framework scores 13/48

Worth pausing on this. Seven frameworks, every single one at 13/48 native. That’s identical to the vanilla runner (no governance at all).

This isn’t suspicious — it’s the actual structural reality. No framework’s tool dispatch emits structured audit data without explicit callback wiring. No framework’s tool dispatch enforces per-user policy or rate limits without explicit governance integration.

The categories that vanilla “passes” (4/6 on cross-tenant, 3/6 on rate-limit, etc.) pass because those scenarios assert benign baseline behavior — a tool call that should be allowed is allowed. The categories vanilla fails (0/6 on identity propagation, 0/6 on delegation provenance) require enforcement that no framework provides natively.

If you’re building agents in any of these frameworks and you’ve heard “framework X has audit,” that audit is conditional on you wiring a callback handler. Out of the box: vanilla.

The 16-scenario gap — where governance products earn their keep

The jump from vanilla (13/48) to audit_only (29/48) is what a logging library can buy you: every call captured with attribution, provenance, timestamp. Nothing denied. Nothing rate-limited.

The jump from audit_only (29/48) to ACP at 40-46/48 is enforcement. Sixteen scenarios that ask “did the system actually stop the bad thing, not just record it?” That’s the meaningful business value of a governance product over a logging library.

The variance among ACP-paired runners (40 to 46) is the integration tax — how much context gets lost in the wrapper-to-gateway path. Best case: the wrapper sits at the native dispatch boundary (TS handler-wrapper, 46/48). Worst case: the wrapper sees individual tools but misses framework orchestration (decorator pattern, 40/48). Both are real, both are honest.

Three documented declinations

ACP doesn’t pass three scenarios. They’re documented in every result file and every per-framework post:

  • cross_tenant_isolation.03_user_scope_does_not_leak and .05_admin_cannot_cross — gateway code is shipped, awaiting Cloud Run flip to multi-tenant deployment mode
  • scope_inheritance.04_task_narrowing — SDK-side intent-aware enforcement, not a gateway concern by design

Per-framework declinations layer on top:

  • fail_open_honored declined for Claude Code, Codex CLI, OpenAI Agents SDK (all fail-closed by design)
  • per_user_policy_enforcement.03 declined for runner-side reasons

These are in the scorecard JSON. Not hidden, not glossed over. A benchmark only earns credibility by publishing its own gaps.

How to verify any of this on your stack

git clone https://github.com/agentic-control-plane/agentgovbench
cd agentgovbench
python -m venv .venv && source .venv/bin/activate
pip install -e .

# Install the framework you want to benchmark
pip install crewai acp-crewai            # or whichever
# or for TS frameworks: npm install in a separate workspace

# Point at YOUR ACP project (Firebase service account, see README)
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/firebase-creds.json
export AGB_TENANT_ID=...
export FIREBASE_WEB_API_KEY=...

# Run
python -m benchmark.cli run --runner crewai_acp --out my-result.json

If you see different numbers than ours, that’s either version drift on your ACP deployment or a gap we haven’t found. Either way, please file an issue at the agentgovbench repo.

What’s next

  • NIST AI RMF 1.0 mapping post — how the 48 scenarios map to specific control families, for procurement teams
  • Acp-crewai@0.2.0 + acp-langchain@0.2.0 — closing the chain-context gap. Decorator-pattern scores should rise from 40 → 44.
  • First competitor runner — Guardrails AI, Credo AI, or NeMo Guardrails. PRs welcome.
  • Cursor scorecard deep-dive — the MCP integration boundary deserves its own post.

This series will continue at the cadence of one major drop every 2-3 days for the next month. The benchmark page at /benchmark is the canonical scorecard — bookmark it.


Receipts:

Share: Twitter LinkedIn
More in AgentGovBench
  1. 1. How we think about testing AI agent governance
  2. 2. CrewAI scores 13/48 on AgentGovBench. With ACP, 40/48.
  3. 3. CrewAI's task handoffs lose the audit trail — here's the gap and the fix
  4. 4. LangGraph scores 13/48 on AgentGovBench. With ACP, 40/48.
  5. 5. LangGraph's StateGraph checkpoints don't replay through governance
  6. 6. Claude Code scores 13/48 on AgentGovBench. With ACP, 43/48.
  7. 7. Claude Code's --dangerously-skip-permissions disables every governance hook
  8. 8. Decorator, proxy, hook — three patterns for agent governance, three different scorecards
  9. 9. OpenAI Agents SDK scores 13/48 on AgentGovBench. With ACP, 45/48.
  10. 10. Anthropic Agent SDK scores 13/48 on AgentGovBench. With ACP, 46/48 — best of any framework.
  11. 11. Codex CLI scores 13/48 on AgentGovBench. With ACP, 43/48 — same as Claude Code.
  12. 12. Full scorecard: seven frameworks, 48 scenarios, one open benchmark · you are here
  13. 13. How AgentGovBench's 48 scenarios map to NIST AI RMF 1.0
  14. 14. Reproduce AgentGovBench on your stack — full setup guide
  15. 15. Cursor scores 13/48 on AgentGovBench. With ACP MCP server, 37/48 — and that gap is structural.
  16. 16. Recommended governance deployment patterns — pick the one that scores highest for your stack
  17. 17. What our benchmark told us about our own product — six fixes we're shipping
Related posts

← back to blog