Skip to content
Agentic Control Plane
Benchmark series · Part 6 of 15
AgentGovBench →

Claude Code scores 13/48 on AgentGovBench. With ACP, 43/48.

David Crowe · · 4 min read
claude-code anthropic benchmark governance agentgovbench

tl;dr

Third framework in the series. Same shape, different pattern — Claude Code uses a hook rather than a decorator.

Configuration Score
Claude Code (no PreToolUse hook) 13/48 — same vanilla floor as CrewAI/LangGraph native
Claude Code + ACP hook 43/48

Claude Code shipping without a hook isn’t surprising — it has interactive permission prompts that are user-friendly but produce no audit data and enforce no policy beyond “did the user approve this individual call.” For a security/compliance audience, interactive prompts are noise, not signal.

The ACP hook (~/.acp/govern.mjs, installed by curl -sf https://agenticcontrolplane.com/install.sh | bash) intercepts every PreToolUse and PostToolUse event. Every Bash, Edit, Write, WebFetch, MCP tool call passes through /govern/tool-use → workspace policy → audit.

Two things distinct from the decorator-pattern frameworks (CrewAI, LangGraph):

  1. Hook is fail-closed by default. When the gateway is unreachable, Claude Code’s PreToolUse hook denies the call — opposite of acp-crewai and acp-langchain which fail-open. Anthropic chose safety over availability for the agent shell. This means fail_mode_discipline.02_fail_open_honored is documented as a declined scenario for claude_code_acp.

  2. --dangerously-skip-permissions disables ALL hooks. Including ACP’s. There’s no server-side detection — the hook simply doesn’t fire. Audit goes silent. This is by design (Anthropic ships an escape hatch) but it’s a known governance gap that any deployment should mitigate at the OS or shell level. We’ve covered this in tomorrow’s deep-dive.

Per-category breakdown

Category Native + ACP Note
Audit completeness 1/6 6/6 Every tool call → /govern/tool-use → audit row.
Cross-tenant isolation 4/6 4/6 Two declined (single-tenant deployment mode).
Delegation provenance 0/6 6/6 Best in class. Claude Code’s hook payload carries chain context that decorator-pattern SDKs don’t.
Fail-mode discipline 3/6 4/6 Hook is fail-closed; fail_open_honored declined as a design choice.
Identity propagation 0/6 6/6 ~/.acp/credentials carries the user’s bearer token.
Per-user policy enforcement 1/6 6/6 Workspace policy applied per call.
Rate-limit cascade 3/6 5/6 Per-user budget enforced via the bearer token identity.
Scope inheritance 1/6 6/6 Best in class — same hook-payload advantage as delegation provenance.
Total 13/48 43/48  

Why Claude Code + ACP scores higher than CrewAI/LangGraph + ACP

Worth pausing on: Claude Code + ACP scores 43/48 while CrewAI + ACP and LangGraph + ACP both score 40/48. Same governance backend, same gateway. So why the 3-scenario gap?

The hook protocol carries richer per-call context than the @governed decorator does today. Specifically:

  • Claude Code’s hook payload includes the full subagent chain (when the agent tool spawns subagents like Explore or Plan).
  • Claude Code’s permission_mode field maps cleanly to ACP agent tiers.
  • Claude Code emits one event per tool call; there’s no SDK abstraction layer that loses context.

The decorator-pattern SDKs (acp-crewai, acp-langchain) wrap individual tool functions but don’t see the framework’s internal task-routing or state-mutation events. We covered this gap in the CrewAI deep dive and the LangGraph one. Both SDKs are getting fixes in 0.2.0 — when those land, decorator scores will close most of the gap to Claude Code.

This per-framework variation is exactly what a credible benchmark should produce. Different patterns score differently. Same gateway, different exposure.

What this means for Claude Code in your team

Two situations:

Solo dev. Claude Code’s interactive prompts are fine for personal use. You see every call and approve it. ACP adds value if you want a structured audit log you can query later, or if you’re using subagents that bypass the prompt for routine calls.

Team deployment. Claude Code without a hook gives security/IT teams nothing — no audit log they can ingest, no policy they can enforce, no per-user attribution. The install.sh is one command that installs the hook for everyone in the team. From that moment, every Bash, Edit, and MCP call from every team member’s Claude Code is logged and governable.

Critically: the hook is idempotent and adds to rather than replaces existing Claude Code hooks. Teams already using Claude Code’s hook system for other purposes don’t lose them.

What’s next

Tomorrow: the --dangerously-skip-permissions deep dive. The escape hatch, what it bypasses, and what teams should do about it (server-side detection, OS-level wrappers, shell aliases, audit-gap alerting).

This week: OpenAI Agents SDK + Anthropic Agent SDK scorecards. Both with-ACP score nearly identical to pure ACP because they’re proxy-based — different shape from the hook and decorator patterns we’ve covered.


Receipts:

Share: Twitter LinkedIn
More in AgentGovBench
  1. 1. How we think about testing AI agent governance
  2. 2. CrewAI scores 13/48 on AgentGovBench. With ACP, 40/48.
  3. 3. CrewAI's task handoffs lose the audit trail — here's the gap and the fix
  4. 4. LangGraph scores 13/48 on AgentGovBench. With ACP, 40/48.
  5. 5. LangGraph's StateGraph checkpoints don't replay through governance
  6. 6. Claude Code scores 13/48 on AgentGovBench. With ACP, 43/48. · you are here
  7. 7. Claude Code's --dangerously-skip-permissions disables every governance hook
  8. 8. Decorator, proxy, hook — three patterns for agent governance, three different scorecards
  9. 9. OpenAI Agents SDK scores 13/48 on AgentGovBench. With ACP, 45/48.
  10. 10. Anthropic Agent SDK scores 13/48 on AgentGovBench. With ACP, 46/48 — best of any framework.
  11. 11. Codex CLI scores 13/48 on AgentGovBench. With ACP, 43/48 — same as Claude Code.
  12. 12. Full scorecard: seven frameworks, 48 scenarios, one open benchmark
  13. 13. How AgentGovBench's 48 scenarios map to NIST AI RMF 1.0
  14. 14. Reproduce AgentGovBench on your stack — full setup guide
  15. 15. Cursor scores 13/48 on AgentGovBench. With ACP MCP server, 37/48 — and that gap is structural.
Related posts

← back to blog