Claude Code scores 13/48 on AgentGovBench. With ACP, 43/48.
tl;dr
Third framework in the series. Same shape, different pattern — Claude Code uses a hook rather than a decorator.
| Configuration | Score |
|---|---|
| Claude Code (no PreToolUse hook) | 13/48 — same vanilla floor as CrewAI/LangGraph native |
| Claude Code + ACP hook | 43/48 |
Claude Code shipping without a hook isn’t surprising — it has interactive permission prompts that are user-friendly but produce no audit data and enforce no policy beyond “did the user approve this individual call.” For a security/compliance audience, interactive prompts are noise, not signal.
The ACP hook (~/.acp/govern.mjs, installed by curl -sf https://agenticcontrolplane.com/install.sh | bash) intercepts every PreToolUse and PostToolUse event. Every Bash, Edit, Write, WebFetch, MCP tool call passes through /govern/tool-use → workspace policy → audit.
Two things distinct from the decorator-pattern frameworks (CrewAI, LangGraph):
-
Hook is fail-closed by default. When the gateway is unreachable, Claude Code’s PreToolUse hook denies the call — opposite of
acp-crewaiandacp-langchainwhich fail-open. Anthropic chose safety over availability for the agent shell. This meansfail_mode_discipline.02_fail_open_honoredis documented as a declined scenario forclaude_code_acp. -
--dangerously-skip-permissionsdisables ALL hooks. Including ACP’s. There’s no server-side detection — the hook simply doesn’t fire. Audit goes silent. This is by design (Anthropic ships an escape hatch) but it’s a known governance gap that any deployment should mitigate at the OS or shell level. We’ve covered this in tomorrow’s deep-dive.
Per-category breakdown
| Category | Native | + ACP | Note |
|---|---|---|---|
| Audit completeness | 1/6 | 6/6 | Every tool call → /govern/tool-use → audit row. |
| Cross-tenant isolation | 4/6 | 4/6 | Two declined (single-tenant deployment mode). |
| Delegation provenance | 0/6 | 6/6 | Best in class. Claude Code’s hook payload carries chain context that decorator-pattern SDKs don’t. |
| Fail-mode discipline | 3/6 | 4/6 | Hook is fail-closed; fail_open_honored declined as a design choice. |
| Identity propagation | 0/6 | 6/6 | ~/.acp/credentials carries the user’s bearer token. |
| Per-user policy enforcement | 1/6 | 6/6 | Workspace policy applied per call. |
| Rate-limit cascade | 3/6 | 5/6 | Per-user budget enforced via the bearer token identity. |
| Scope inheritance | 1/6 | 6/6 | Best in class — same hook-payload advantage as delegation provenance. |
| Total | 13/48 | 43/48 |
Why Claude Code + ACP scores higher than CrewAI/LangGraph + ACP
Worth pausing on: Claude Code + ACP scores 43/48 while CrewAI + ACP and LangGraph + ACP both score 40/48. Same governance backend, same gateway. So why the 3-scenario gap?
The hook protocol carries richer per-call context than the @governed decorator does today. Specifically:
- Claude Code’s hook payload includes the full subagent chain (when the agent tool spawns subagents like Explore or Plan).
- Claude Code’s permission_mode field maps cleanly to ACP agent tiers.
- Claude Code emits one event per tool call; there’s no SDK abstraction layer that loses context.
The decorator-pattern SDKs (acp-crewai, acp-langchain) wrap individual tool functions but don’t see the framework’s internal task-routing or state-mutation events. We covered this gap in the CrewAI deep dive and the LangGraph one. Both SDKs are getting fixes in 0.2.0 — when those land, decorator scores will close most of the gap to Claude Code.
This per-framework variation is exactly what a credible benchmark should produce. Different patterns score differently. Same gateway, different exposure.
What this means for Claude Code in your team
Two situations:
Solo dev. Claude Code’s interactive prompts are fine for personal use. You see every call and approve it. ACP adds value if you want a structured audit log you can query later, or if you’re using subagents that bypass the prompt for routine calls.
Team deployment. Claude Code without a hook gives security/IT teams nothing — no audit log they can ingest, no policy they can enforce, no per-user attribution. The install.sh is one command that installs the hook for everyone in the team. From that moment, every Bash, Edit, and MCP call from every team member’s Claude Code is logged and governable.
Critically: the hook is idempotent and adds to rather than replaces existing Claude Code hooks. Teams already using Claude Code’s hook system for other purposes don’t lose them.
What’s next
Tomorrow: the --dangerously-skip-permissions deep dive. The escape hatch, what it bypasses, and what teams should do about it (server-side detection, OS-level wrappers, shell aliases, audit-gap alerting).
This week: OpenAI Agents SDK + Anthropic Agent SDK scorecards. Both with-ACP score nearly identical to pure ACP because they’re proxy-based — different shape from the hook and decorator patterns we’ve covered.
Receipts:
- agentgovbench repo
- claude_code_native runner
- claude_code_acp runner
- claude-code-native-v0.1.json results
- claude-code-acp-v0.1.json results
- /integrations/claude-code install guide
- Yesterday’s LangGraph posts
- Methodology post
- 1. How we think about testing AI agent governance
- 2. CrewAI scores 13/48 on AgentGovBench. With ACP, 40/48.
- 3. CrewAI's task handoffs lose the audit trail — here's the gap and the fix
- 4. LangGraph scores 13/48 on AgentGovBench. With ACP, 40/48.
- 5. LangGraph's StateGraph checkpoints don't replay through governance
- 6. Claude Code scores 13/48 on AgentGovBench. With ACP, 43/48. · you are here
- 7. Claude Code's --dangerously-skip-permissions disables every governance hook
- 8. Decorator, proxy, hook — three patterns for agent governance, three different scorecards
- 9. OpenAI Agents SDK scores 13/48 on AgentGovBench. With ACP, 45/48.
- 10. Anthropic Agent SDK scores 13/48 on AgentGovBench. With ACP, 46/48 — best of any framework.
- 11. Codex CLI scores 13/48 on AgentGovBench. With ACP, 43/48 — same as Claude Code.
- 12. Full scorecard: seven frameworks, 48 scenarios, one open benchmark
- 13. How AgentGovBench's 48 scenarios map to NIST AI RMF 1.0
- 14. Reproduce AgentGovBench on your stack — full setup guide
- 15. Cursor scores 13/48 on AgentGovBench. With ACP MCP server, 37/48 — and that gap is structural.