Skip to content
Agentic Control Plane
Benchmark series · Part 10 of 17
AgentGovBench →

Anthropic Agent SDK scores 13/48 on AgentGovBench. With ACP, 46/48 — best of any framework.

David Crowe · · 4 min read
anthropic anthropic-agent-sdk benchmark governance agentgovbench

tl;dr

Fifth framework. Highest ACP-paired score we’ve seen.

Configuration Score
Anthropic Agent SDK (no governance wrapper) 13/48 — vanilla floor (fifth framework confirmation)
Anthropic Agent SDK + ACP (governHandlers) 46/48 ⭐ best in class

One ahead of OpenAI Agents SDK proxy (45/48). Three ahead of decorator-pattern frameworks (40/48). Same gateway behind all three.

Why higher? The TypeScript governHandlers wrapper sits closer to the request envelope than the Python decorator does. Single-agent Claude tool-use loops are the most scoped pattern; less framework orchestration to lose context across.

Why Anthropic Agent SDK native scores at the floor

The Anthropic Agent SDK (and the Claude Agent SDK) ship as TypeScript libraries for tool-use loops around Claude. Out of the box you get:

  • Tool definitions with handler functions
  • Message lifecycle management
  • Optional thinking/extended-thinking blocks
  • Session and state primitives in newer versions

What you don’t get without explicit wiring:

  • Per-end-user identity propagation (one ANTHROPIC_API_KEY per process)
  • A workspace policy concept
  • Per-tool scope enforcement
  • SIEM-ingestible audit log
  • Rate-limit cascade discipline at the application layer
  • Fail-mode discipline for an external governance plane

Score: 13/48. Identical floor to every other framework’s bare default. The pattern holds across CrewAI, LangGraph, Claude Code, OpenAI Agents SDK, Codex CLI, Cursor, and now Anthropic Agent SDK.

What ACP adds

@agenticcontrolplane/governance-anthropic exports two functions:

import Anthropic from "@anthropic-ai/sdk";
import { governHandlers, withContext } from "@agenticcontrolplane/governance-anthropic";

const handlers = governHandlers({
  web_search: async ({ query }) => doSearch(query),
  send_email: async ({ to, subject, body }) => sendMail(to, subject, body),
});

app.post("/run", async (req, res) => {
  const userToken = req.headers.authorization?.replace("Bearer ", "");
  await withContext({ userToken }, async () => {
    // run the Anthropic tool-use loop with the wrapped handlers
    // ...
  });
});

governHandlers wraps every handler in the map. withContext binds the end-user JWT for the duration of the request. Same /govern/tool-use endpoint as every other ACP integration.

Full integration guide →

Per-category breakdown

Category Native + ACP Note
Audit completeness 1/6 6/6 Every handler invocation logged.
Cross-tenant isolation 4/6 4/6 Two declined (single-tenant deployment mode).
Delegation provenance 0/6 6/6 Best-class — single-agent loop has no orchestration to lose.
Fail-mode discipline 3/6 6/6 Both fail-open and fail-closed honored.
Identity propagation 0/6 6/6 withContext binds JWT per-request.
Per-user policy enforcement 1/6 6/6 Allow/deny/redact per call.
Rate-limit cascade 3/6 6/6 Per-user budget enforced.
Scope inheritance 1/6 6/6 Best-class — same root cause as delegation provenance.
Total 13/48 46/48  

Why this is the highest score

Three structural reasons the Anthropic Agent SDK + ACP wins:

1. Single-agent loop pattern. Most uses of the Anthropic Agent SDK are single-agent — one Claude instance processing tool-use loops. There’s no inter-agent handoff context to lose. CrewAI and LangGraph have orchestration layers (Hierarchical Process, StateGraph supervisors) that the decorator can’t see; Anthropic SDK doesn’t.

2. The TypeScript wrapper sits at the native dispatch boundary. governHandlers wraps the handler map that the Anthropic SDK calls into directly. There’s no SDK abstraction between the wrapper and the actual tool execution.

3. Both fail modes honored. Unlike Claude Code’s fail-closed-only hook, the Anthropic Agent SDK wrapper can implement both fail-open and fail-closed paths because it’s library code, not a CLI’s permission system. Picks up the extra fail_open_honored scenario that hook patterns can’t.

When to pick Anthropic Agent SDK + ACP

If you’re building a single-agent Claude tool-use loop in TypeScript and want governance, this is the integration shape that scores highest. Specifically:

  • Customer-service bots powered by Claude
  • Document-processing agents with a few tools
  • Per-request workflows where one Claude instance handles end-to-end

If you’re building multi-agent systems with delegation, you’ll be in CrewAI or LangGraph territory and the decorator pattern’s 40/48 is the realistic ceiling until SDK 0.2.0 closes the chain-context gap.

What this means for the bigger picture

We’ve now scored five ACP-paired frameworks. The pattern is clear:

Pattern Score range Why
Decorator (Python multi-agent) 40/48 Loses framework orchestration context
Hook (CLI) 43/48 Host’s payload carries chain context, fail-closed only
Proxy (HTTP) 45/48 Sits at request serialization boundary
TS handler-map wrapper (single-agent) 46/48 Native dispatch boundary, both fail modes

The variance is real, structural, and consistent across runs. The pattern shape determines the score, not the underlying gateway.


Receipts:

Share: Twitter LinkedIn
More in AgentGovBench
  1. 1. How we think about testing AI agent governance
  2. 2. CrewAI scores 13/48 on AgentGovBench. With ACP, 40/48.
  3. 3. CrewAI's task handoffs lose the audit trail — here's the gap and the fix
  4. 4. LangGraph scores 13/48 on AgentGovBench. With ACP, 40/48.
  5. 5. LangGraph's StateGraph checkpoints don't replay through governance
  6. 6. Claude Code scores 13/48 on AgentGovBench. With ACP, 43/48.
  7. 7. Claude Code's --dangerously-skip-permissions disables every governance hook
  8. 8. Decorator, proxy, hook — three patterns for agent governance, three different scorecards
  9. 9. OpenAI Agents SDK scores 13/48 on AgentGovBench. With ACP, 45/48.
  10. 10. Anthropic Agent SDK scores 13/48 on AgentGovBench. With ACP, 46/48 — best of any framework. · you are here
  11. 11. Codex CLI scores 13/48 on AgentGovBench. With ACP, 43/48 — same as Claude Code.
  12. 12. Full scorecard: seven frameworks, 48 scenarios, one open benchmark
  13. 13. How AgentGovBench's 48 scenarios map to NIST AI RMF 1.0
  14. 14. Reproduce AgentGovBench on your stack — full setup guide
  15. 15. Cursor scores 13/48 on AgentGovBench. With ACP MCP server, 37/48 — and that gap is structural.
  16. 16. Recommended governance deployment patterns — pick the one that scores highest for your stack
  17. 17. What our benchmark told us about our own product — six fixes we're shipping
Related posts

← back to blog