Benchmark series · Part 12 of 11

Full scorecard: seven frameworks, 48 scenarios, one open benchmark

David Crowe · April 20, 2026 · 17 min read

benchmark governance scorecard agentgovbench comparison

Seven agent frameworks, 48 governance scenarios each, run two ways: the bare framework, then the same framework with ACP attached. Deterministic, reproducible, run against the production gateway.

Framework	Pattern	Native	+ ACP
CrewAI	Decorator (Python)	13/48	40/48
LangChain / LangGraph	Decorator (Python)	13/48	40/48
Cursor	MCP	13/48	37/48
Claude Code	Hook (CLI)	13/48	43/48
OpenAI Codex CLI	Hook (CLI)	13/48	43/48
OpenAI Agents SDK	Proxy	13/48	45/48
Anthropic Agent SDK	TS handler-wrapper	13/48	46/48

All 48 scenarios, 8 categories × 6 each. Every framework starts at the same vanilla floor (13/48); attaching ACP lifts each to 37–46 depending on how much of the call path its integration pattern can see. Scores from the open agentgovbench harness — rerun them yourself.

Two results matter:

Every framework scores 13/48 native — identical to the vanilla runner, which is no framework at all. Bare defaults emit no audit and enforce no policy.
The integration pattern sets the score, not the framework. Decorator ~40, MCP 37, hook 43, proxy 45, handler-wrapper 46. Same /govern/tool-use endpoint behind all of them.

Per-category numbers, pattern notes, and repro steps below.

The complete table

Every cell is <passed>/<total> for that category. Bold = best in class.

Category	vanilla	audit-only	crewai_acp	langgraph_acp	cursor_acp	claude_code_acp	codex_acp	openai_agents_acp	anthropic_agent_sdk_acp	acp (direct)
Audit completeness	1/6	5/6	6/6	6/6	6/6	6/6	6/6	6/6	6/6	6/6
Cross-tenant isolation	4/6	4/6	4/6	4/6	4/6	4/6	4/6	4/6	4/6	4/6
Delegation provenance	0/6	5/6	2/6	2/6	4/6	6/6	6/6	6/6	6/6	6/6
Fail-mode discipline	3/6	4/6	6/6	6/6	3/6	4/6	4/6	6/6	6/6	6/6
Identity propagation	0/6	6/6	6/6	6/6	5/6	6/6	6/6	6/6	6/6	6/6
Per-user policy enforcement	1/6	1/6	6/6	6/6	5/6	6/6	6/6	6/6	6/6	6/6
Rate-limit cascade	3/6	3/6	6/6	6/6	4/6	5/6	5/6	5/6	6/6	5/6
Scope inheritance	1/6	1/6	4/6	4/6	6/6	6/6	6/6	6/6	6/6	6/6
Total	13/48	29/48	40/48	40/48	37/48	43/48	43/48	45/48	46/48	45/48

Pattern notes

Same governance backend, different scores. The difference is how much of the call path each integration shape can see.

Decorator (~40/48): CrewAI and LangChain/LangGraph wrap individual tool functions. Identity, audit, per-user policy, rate limits all pass. Loses delegation_provenance (2/6) and scope_inheritance (4/6) because a decorator can’t see framework orchestration — Hierarchical Process handoffs in CrewAI, StateGraph state mutations in LangGraph. Both fixes are on the SDK 0.2.0 roadmap.

MCP (37/48): Cursor only governs tools exposed through the MCP server. Internal IDE tools (Edit, Read, Bash) bypass MCP entirely. That’s a structural gap of the MCP shape, not something a gateway can fix from the outside.

Hook (43/48): Claude Code, Codex CLI. Best on delegation_provenance and scope_inheritance — the host’s hook payload carries chain context natively. Drops one scenario on fail_open_honored because hooks are fail-closed by design; Anthropic and OpenAI both chose safety over availability for their CLIs.

Proxy (45/48): OpenAI Agents SDK. The proxy sits at the request-serialization boundary and sees the full HTTP envelope. Same fail-closed choice as the hooks.

TS handler-wrapper (46/48): Anthropic Agent SDK. Single-agent loop, native dispatch boundary, both fail modes honored. Least framework abstraction between the wrapper and the actual tool execution, so the least context lost.

The 13/48 floor

Seven frameworks, all at 13/48 native — the same score as no framework at all.

No framework’s tool dispatch emits structured audit data without explicit callback wiring. None enforces per-user policy or rate limits without explicit integration. The categories vanilla “passes” (4/6 cross-tenant, 3/6 rate-limit) pass because those scenarios assert benign baseline behavior — a call that should be allowed is allowed. The ones it fails at 0/6 (identity propagation, delegation provenance) need enforcement nothing provides out of the box.

If you’ve heard “framework X has audit”: that audit is conditional on you wiring a callback handler. Out of the box, vanilla.

Audit vs enforcement

vanilla → audit_only (13 → 29) is what a logging library buys: every call captured with attribution, provenance, timestamp. Nothing denied, nothing rate-limited.

audit_only → ACP (29 → 40–46) is enforcement — the sixteen scenarios that ask whether the bad call was stopped, not just recorded.

The spread among ACP-paired runners (40 to 46) is integration tax: how much context gets lost between the wrapper and the gateway. Native dispatch boundary loses the least (46); a decorator that can’t see orchestration loses the most (40).

Three declinations

ACP doesn’t pass three scenarios. They’re in every result file:

cross_tenant_isolation.03_user_scope_does_not_leak and .05_admin_cannot_cross — gateway code is shipped, awaiting Cloud Run flip to multi-tenant deployment mode
scope_inheritance.04_task_narrowing — SDK-side intent-aware enforcement, not a gateway concern by design

Per-framework declinations on top:

fail_open_honored declined for Claude Code, Codex CLI, OpenAI Agents SDK (all fail-closed by design)
per_user_policy_enforcement.03 declined for runner-side reasons

All of it is in the scorecard JSON.

Run it on your stack

git clone https://github.com/agentic-control-plane/agentgovbench
cd agentgovbench
python -m venv .venv && source .venv/bin/activate
pip install -e .

# Install the framework you want to benchmark
pip install crewai acp-crewai            # or whichever
# or for TS frameworks: npm install in a separate workspace

# Point at YOUR ACP project (Firebase service account, see README)
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/firebase-creds.json
export AGB_TENANT_ID=...
export FIREBASE_WEB_API_KEY=...

# Run
python -m benchmark.cli run --runner crewai_acp --out my-result.json

Different numbers than ours means version drift on your ACP deployment or a gap we haven’t found. Either way, file an issue at the agentgovbench repo.

NIST AI RMF 1.0 mapping — the 48 scenarios by control family, for procurement teams
acp-crewai@0.2.0 + acp-langchain@0.2.0 — closing the chain-context gap; decorator scores should rise 40 → 44
First competitor runner — Guardrails AI, Credo AI, or NeMo Guardrails. PRs welcome.
Cursor deep-dive — the MCP integration boundary deserves its own post

The /benchmark page is the canonical scorecard.

Receipts:

agentgovbench repo — all runners + results
/benchmark page — live, sortable, always current
Methodology post — how the benchmark is designed
Decorator vs proxy vs hook
All per-framework scorecards: CrewAI · LangGraph · Claude Code · Codex CLI · OpenAI Agents SDK · Anthropic Agent SDK

Share: Twitter LinkedIn