What our benchmark told us about our own product — six fixes we're shipping
tl;dr
We built AgentGovBench to benchmark governance products. We ran it against ours. The benchmark told us six things to fix. This post is the dated roadmap for each.
Vendors who ship benchmarks and then act on the gaps the benchmark surfaces are the ones procurement should trust. Vendors who ship benchmarks and ignore the gaps are running marketing campaigns.
The six items, in order of leverage:
acp-crewai@0.2.0+acp-langchain@0.2.0: chain context propagation. +4 to both decorator-pattern scores. Shipping next week.- Multi-tenant deployment mode flip. +2 to every ACP-paired score. Shipping in 2-3 weeks (deployment migration).
- Task-scoped narrowing in the gateway. +1 across all runners. Q3 2026.
acp-langchaincheckpoint replay helper. Niche but real. Ships with0.2.0.- Audit-silence anomaly detection in the dashboard. Mitigates
--dangerously-skip-permissions. Shipping in 2 weeks. - MCP server-side enforcement guide for IDEs. Not a code fix — a deployment pattern doc. Shipping with this post (see below).
#1 — Chain context propagation in decorator SDKs
The gap: acp-crewai and acp-langchain both score 40/48, with 5 scenarios short of ACP’s 45/48 ceiling. The miss is concentrated in delegation_provenance (2/6) and scope_inheritance (4/6).
Root cause: The @governed decorator wraps individual tool functions. It doesn’t see CrewAI’s task handoffs or LangGraph’s StateGraph node-to-node transitions. So when a worker agent calls a tool, the gateway sees the call as originating from a top-level agent rather than as the third hop in a chain.
The fix: Extend GovernanceContext with agent_chain: list[str]. Make install_crew_hooks(crew) (CrewAI) and a new install_langgraph_hooks(graph) (LangGraph) update the active chain as agents change. The @governed wrapper picks up the updated context on the next call.
Estimated impact: +4 scenarios across both SDKs (delegation_provenance: 2 → 6, scope_inheritance: 4 → 5). Both decorator-pattern scores rise from 40 → 44.
Status: PR draft open. Shipping acp-crewai@0.2.0 + acp-langchain@0.2.0 next week. Score re-published in results/crewai-acp-v0.2.json and results/langgraph-acp-v0.2.json.
#2 — Multi-tenant deployment mode flip
The gap: Two of ACP’s three documented declinations (cross_tenant_isolation.03_user_scope_does_not_leak and .05_admin_cannot_cross) are blocked on the gateway running in single-tenant mode.
Root cause: The gateway code to honor path-based tenant routing is shipped (commit a920e5a, resolveHookIdentity prefers URL slug when set, verifies membership). But the deployed Cloud Run service runs with TENANT_ID env set, all requests resolving to the default tenant. Flipping to multi-tenant mode is a deployment config change with real blast-radius for existing customers.
The fix: Two-phase migration:
- Phase 1 (now): all new tenants are provisioned with explicit slug routing. Existing tenants keep working via the env-default fallback.
- Phase 2 (target: 2-3 weeks): drop the
TENANT_IDenv, require explicit slug for all requests, communicate the breaking change to existing tenants with a 2-week migration window.
Estimated impact: +2 scenarios on every ACP-paired runner (cross_tenant_isolation: 4/6 → 6/6). Brings best-case ACP score from 46 → 48.
Status: Phase 1 in progress. Phase 2 deployment ETA early May.
#3 — Task-scoped narrowing in the gateway
The gap: scope_inheritance.04_task_narrowing is declined for ACP because we enforce task-scoped narrowing in the SDK layer (where intent-aware enforcement belongs by design), but the benchmark scenario asserts a gateway-only check.
Root cause: When an orchestrator declares delegated_scopes for a subagent, the subagent shouldn’t be able to pivot to tools requiring scopes outside that set — even if the underlying user has them. We enforce this client-side via the SDK’s runner layer. The gateway doesn’t (and arguably shouldn’t) know what “delegation” means semantically.
The fix: Add a gateway-side mode where delegation chain + declared scopes propagate via headers, and the gateway intersects scope-against-required-scope per-call. Optional, opt-in per workspace. Doesn’t replace SDK enforcement; layers on top for defense in depth.
Estimated impact: +1 scenario on every ACP-paired runner. Decorator scores: 40 → 45 with #1 + #3.
Status: Design doc in progress. Q3 2026 target.
#4 — LangGraph checkpoint replay helper
The gap: LangGraph’s StateGraph checkpoint replay loses governance context — when a graph resumes from a checkpoint, no governance pipeline re-runs against the replayed state. Specific to LangGraph but real.
The fix: acp-langchain@0.2.0 adds governed_state_resume(config) — a helper that walks a checkpoint state, identifies tool outputs by trace ID, and re-evaluates each through the gateway against current policy. Returns a list of stale tool outputs that would no longer be allowed. Application decides whether to abort, redact, or proceed with annotation.
Estimated impact: Closes the LangGraph-specific staleness gap for teams using checkpoint-heavy patterns. Doesn’t affect the headline scorecard number directly but materially changes posture for long-lived LangGraph sessions.
Status: Shipping in acp-langchain@0.2.0 next week.
#5 — Audit-silence anomaly detection
The gap: Claude Code’s --dangerously-skip-permissions disables every PreToolUse hook including ACP’s. Server-side detection is hard — ACP can’t see hooks that don’t fire.
The fix: Track expected per-user call volume (rolling 7-day baseline of business-hours activity). Alert when a user with non-trivial historical activity goes silent for >2 hours during business hours. Surface as audit_silence_anomaly in the dashboard’s Activity → Anomalies tab.
This isn’t deterministic. False positives on PTO, focused-work blocks, holidays. But it’s the best signal a server-side governance plane can provide for a client-side bypass.
Estimated impact: Doesn’t change any scorecard scenario. Adds a real production capability for a known unfixable client-side gap.
Status: Implementation in progress. Shipping in 2 weeks.
#6 — MCP server-side enforcement for IDE deployments
The gap: Cursor’s internal tools (Edit, Read, Bash) bypass MCP entirely — and the same is true of any IDE that has primitive operations outside its MCP layer.
The fix isn’t a code fix. It’s a deployment-pattern guide:
- Network-layer enforcement. Block production endpoints from developer machines except through proxied paths ACP fronts. Internal-tool calls that would touch production are caught at the network, not at the IDE.
- Git-layer enforcement. Branch protection, required reviews on production paths, CI checks on agent-authored commits. Catches what client-side governance can’t reach.
- Endpoint-level policy. MDM/EDR rules that observe IDE process invocations and alert on suspicious patterns.
Why no code fix: ACP can’t intercept what the IDE doesn’t expose. The only way to fully close this gap would be for Cursor itself to expose a hook protocol (similar to Claude Code’s PreToolUse) for internal tools. We’ve filed that feedback with Cursor.
Status: Deployment guide ships with this post — see the recommendations. Live discussions with Cursor on the hook-protocol ask.
What this means for the scorecard
If everything ships as planned, the next major scorecard refresh (target: 6 weeks out) will look like:
| Framework | Today | After fixes #1-#5 |
|---|---|---|
| CrewAI | 40/48 | 47/48 |
| LangGraph | 40/48 | 47/48 |
| Cursor | 37/48 | 39/48 (no fix for internal-tools gap) |
| Claude Code | 43/48 | 45/48 |
| Codex CLI | 43/48 | 45/48 |
| OpenAI Agents SDK | 45/48 | 47/48 |
| Anthropic Agent SDK | 46/48 | 48/48 ⭐ |
| ACP (direct) | 45/48 | 47/48 |
Six fixes, ranked. Two ship in two weeks. Six weeks to perfect on the cleanest path. The Cursor gap is structural — ACP can document it, build deployment patterns around it, but can’t close it without IDE cooperation.
This is what shipping a benchmark and then acting on it looks like. We’ll publish a “what the benchmark told us — fix #N landed” follow-up for each item as it ships, with the new score posted publicly.
Receipts:
- Full scorecard post
- Recommended deployment patterns
- agentgovbench repo
- Roadmap tracking issues — public, with dates
Next in series: “AgentGovBench v0.3 — what’s coming” (held-out scenarios, two new categories, vendor-contributed runners).
- 1. How we think about testing AI agent governance
- 2. CrewAI scores 13/48 on AgentGovBench. With ACP, 40/48.
- 3. CrewAI's task handoffs lose the audit trail — here's the gap and the fix
- 4. LangGraph scores 13/48 on AgentGovBench. With ACP, 40/48.
- 5. LangGraph's StateGraph checkpoints don't replay through governance
- 6. Claude Code scores 13/48 on AgentGovBench. With ACP, 43/48.
- 7. Claude Code's --dangerously-skip-permissions disables every governance hook
- 8. Decorator, proxy, hook — three patterns for agent governance, three different scorecards
- 9. OpenAI Agents SDK scores 13/48 on AgentGovBench. With ACP, 45/48.
- 10. Anthropic Agent SDK scores 13/48 on AgentGovBench. With ACP, 46/48 — best of any framework.
- 11. Codex CLI scores 13/48 on AgentGovBench. With ACP, 43/48 — same as Claude Code.
- 12. Full scorecard: seven frameworks, 48 scenarios, one open benchmark
- 13. How AgentGovBench's 48 scenarios map to NIST AI RMF 1.0
- 14. Reproduce AgentGovBench on your stack — full setup guide
- 15. Cursor scores 13/48 on AgentGovBench. With ACP MCP server, 37/48 — and that gap is structural.
- 16. Recommended governance deployment patterns — pick the one that scores highest for your stack
- 17. What our benchmark told us about our own product — six fixes we're shipping · you are here