Skip to content
Agentic Control Plane
Benchmark series · Part 16 of 17
AgentGovBench →

Recommended governance deployment patterns — pick the one that scores highest for your stack

David Crowe · · 4 min read
governance deployment recommendation agentgovbench architecture

tl;dr

After benchmarking seven AI agent frameworks against the same governance backend, one finding stands out:

The integration pattern matters more than the framework brand. Two CrewAI deployments and two LangGraph deployments score the same 40/48. An Anthropic Agent SDK deployment scores 46/48 — six scenarios of additional enforcement, just from a better-shaped wrapper.

This post is the customer-facing recommendation that follows: if you’re picking how to deploy a governed agent, here’s what the data says you should choose.

The recommendation table

You’re building Pick Score Why
Single-agent Claude tool-use loop in TypeScript Anthropic Agent SDK + governHandlers 46/48 Native dispatch boundary, both fail modes honored
Multi-agent system with OpenAI/compatible models OpenAI Agents SDK + ACP base_url swap 45/48 Proxy sits at request serialization boundary
Code-editing agent in your terminal Claude Code or Codex CLI + ACP hook 43/48 Hook payload carries chain context
Multi-agent framework in Python (CrewAI/LangGraph) Decorator (@governed) 40/48 Will rise to ~44 when SDK 0.2.0 ships chain context propagation
IDE-driven agent (Cursor) MCP integration + server-side mitigations 37/48 MCP can’t reach internal IDE tools — structural ceiling

Underlying score data: /benchmark and the full scorecard post.

Pick by what you’re trying to govern

Governance over external API calls (Slack, GitHub, Stripe, your own backend, customer data) — every pattern here covers this well. Pick on UX preference. The 40-46 range applies to the categories that matter for external-tool governance.

Governance over agent delegation chains (multi-agent systems where one agent spawns another) — proxy or hook pattern wins. Anthropic Agent SDK + governHandlers (46) or OpenAI Agents SDK + base_url swap (45) capture chain context cleanly. Decorator pattern is closing the gap in 0.2.0.

Governance over code editing / file ops in an IDE — none of the integration patterns fully reach here. IDE tools dispatch through the IDE’s engine without serializing through any protocol an external governance layer can intercept. You need server-side mitigations (git hooks, branch protection, CI policies, network-layer enforcement) for these. ACP can’t help directly.

Governance over LLM cost / token usage — orthogonal. ACP governs tool calls and actions, not LLM spend. Pair with Portkey, LiteLLM, or your provider’s per-key budgets for cost attribution.

Pick by the trust profile of who’s using it

Internal team, sophisticated users — any pattern is fine. Hook patterns (Claude Code, Codex CLI) give the smoothest interactive UX. Trust-but-verify model.

Internal team, non-technical users — proxy or TS handler-wrapper. The user can’t accidentally bypass network-layer governance the way they could disable a hook.

External customers in a multi-tenant SaaS — proxy pattern (45). Network-layer enforcement is hardest for customers to bypass, and per-agent attribution via headers is built for multi-tenancy.

Compliance-heavy environments (regulated industries) — proxy or TS handler-wrapper. Both score 45-46 and align with audit completeness, identity propagation, and per-user policy enforcement at near-ceiling. The compliance auditor citation story is cleanest with these.

Coding agents on developer machines — hook pattern with the --dangerously-skip-permissions mitigations. Combine with shell aliases, MDM/EDR alerts, and audit-silence anomaly detection on the server side.

When to choose Cursor anyway

Cursor + ACP scores 37/48 because of structural reasons — Cursor’s internal tools never touch MCP. But Cursor is still a great choice for many teams if you understand the trade-off:

  • The score reflects the IDE primitive operations (Edit, Read, Bash) that bypass MCP. MCP-exposed tools (the external services you connect Cursor to) score the same as proxy-pattern integrations.
  • For external-service governance, Cursor + ACP MCP gives you full audit + policy + rate limits.
  • For internal-tool governance, you need server-side mitigations regardless of which IDE you pick — this isn’t unique to Cursor.

So: pick Cursor if you love the IDE, accept the 37/48 with eyes open, and add network-layer / git-layer policies for what MCP can’t reach. The same caveat applies to any IDE that has internal tool primitives outside an external dispatch protocol.

Migration path if you’re already deployed

If you’re running:

  • CrewAI in production → stay. The 40/48 today rises to ~44 in acp-crewai@0.2.0 (shipping soon). For now: prefer Sequential to Hierarchical Process, avoid checkpoint-heavy patterns.
  • LangGraph with checkpoint-heavy state → review the StateGraph governance gap post. Workarounds exist; SDK fix in 0.2.0.
  • Claude Code with --dangerously-skip-permissions use → ship the shell alias mitigation today. Anomaly alerting in the dashboard ships in two weeks.
  • Cursor with ambient tool access → audit which tools are MCP vs internal. Add server-side enforcement for the internal-tool surface. ACP’s MCP server only governs what MCP carries.

What this whole exercise tells us

Pattern shape determines the governance ceiling. Same gateway, seven frameworks, four patterns, four score tiers (37, 40, 43, 45-46). If governance score matters for your deployment, pick the pattern that scores highest for your shape of agent — single-agent loops use TS handler-wrapper, multi-agent systems use proxy, CLIs use hooks, IDE tools use MCP + server-side mitigations.

The framework is downstream of the pattern. Pick the pattern first.


Receipts:

Share: Twitter LinkedIn
More in AgentGovBench
  1. 1. How we think about testing AI agent governance
  2. 2. CrewAI scores 13/48 on AgentGovBench. With ACP, 40/48.
  3. 3. CrewAI's task handoffs lose the audit trail — here's the gap and the fix
  4. 4. LangGraph scores 13/48 on AgentGovBench. With ACP, 40/48.
  5. 5. LangGraph's StateGraph checkpoints don't replay through governance
  6. 6. Claude Code scores 13/48 on AgentGovBench. With ACP, 43/48.
  7. 7. Claude Code's --dangerously-skip-permissions disables every governance hook
  8. 8. Decorator, proxy, hook — three patterns for agent governance, three different scorecards
  9. 9. OpenAI Agents SDK scores 13/48 on AgentGovBench. With ACP, 45/48.
  10. 10. Anthropic Agent SDK scores 13/48 on AgentGovBench. With ACP, 46/48 — best of any framework.
  11. 11. Codex CLI scores 13/48 on AgentGovBench. With ACP, 43/48 — same as Claude Code.
  12. 12. Full scorecard: seven frameworks, 48 scenarios, one open benchmark
  13. 13. How AgentGovBench's 48 scenarios map to NIST AI RMF 1.0
  14. 14. Reproduce AgentGovBench on your stack — full setup guide
  15. 15. Cursor scores 13/48 on AgentGovBench. With ACP MCP server, 37/48 — and that gap is structural.
  16. 16. Recommended governance deployment patterns — pick the one that scores highest for your stack · you are here
  17. 17. What our benchmark told us about our own product — six fixes we're shipping
Related posts

← back to blog