What is an Agentic Control Plane?

An Agentic Control Plane is a governance layer that sits between AI coding agents and their tools. It logs every tool call, enforces permissions and policies, and provides audit trails — giving teams visibility and control over what their AI agents are doing.

How does ACP work with Claude Code?

ACP installs a PreToolUse hook in Claude Code that fires before every tool call (Bash, Read, Write, Edit, WebFetch). The hook sends the tool name and input to ACP's governance API for policy evaluation and audit logging. Round-trip is approximately 200ms.

How do I install ACP?

Run one command: curl -sf https://agenticcontrolplane.com/install.sh | bash. This auto-detects Claude Code and OpenClaw, installs the appropriate hooks, and opens your browser to set up your workspace.

Benchmark series · Part 17 of 17

AgentGovBench →

What our benchmark told us about our own product — six fixes we're shipping

David Crowe · April 20, 2026 · 6 min read

roadmap governance agentgovbench accountability

tl;dr

We built AgentGovBench to benchmark governance products. We ran it against ours. The benchmark told us six things to fix. This post is the dated roadmap for each.

Vendors who ship benchmarks and then act on the gaps the benchmark surfaces are the ones procurement should trust. Vendors who ship benchmarks and ignore the gaps are running marketing campaigns.

The six items, in order of leverage:

acp-crewai@0.2.0 + acp-langchain@0.2.0: chain context propagation. +4 to both decorator-pattern scores. Shipping next week.
Multi-tenant deployment mode flip. +2 to every ACP-paired score. Shipping in 2-3 weeks (deployment migration).
Task-scoped narrowing in the gateway. +1 across all runners. Q3 2026.
acp-langchain checkpoint replay helper. Niche but real. Ships with 0.2.0.
Audit-silence anomaly detection in the dashboard. Mitigates --dangerously-skip-permissions. Shipping in 2 weeks.
MCP server-side enforcement guide for IDEs. Not a code fix — a deployment pattern doc. Shipping with this post (see below).

#1 — Chain context propagation in decorator SDKs

The gap: acp-crewai and acp-langchain both score 40/48, with 5 scenarios short of ACP’s 45/48 ceiling. The miss is concentrated in delegation_provenance (2/6) and scope_inheritance (4/6).

Root cause: The @governed decorator wraps individual tool functions. It doesn’t see CrewAI’s task handoffs or LangGraph’s StateGraph node-to-node transitions. So when a worker agent calls a tool, the gateway sees the call as originating from a top-level agent rather than as the third hop in a chain.

The fix: Extend GovernanceContext with agent_chain: list[str]. Make install_crew_hooks(crew) (CrewAI) and a new install_langgraph_hooks(graph) (LangGraph) update the active chain as agents change. The @governed wrapper picks up the updated context on the next call.

Estimated impact: +4 scenarios across both SDKs (delegation_provenance: 2 → 6, scope_inheritance: 4 → 5). Both decorator-pattern scores rise from 40 → 44.

Status: PR draft open. Shipping acp-crewai@0.2.0 + acp-langchain@0.2.0 next week. Score re-published in results/crewai-acp-v0.2.json and results/langgraph-acp-v0.2.json.

#2 — Multi-tenant deployment mode flip

The gap: Two of ACP’s three documented declinations (cross_tenant_isolation.03_user_scope_does_not_leak and .05_admin_cannot_cross) are blocked on the gateway running in single-tenant mode.

Root cause: The gateway code to honor path-based tenant routing is shipped (commit a920e5a, resolveHookIdentity prefers URL slug when set, verifies membership). But the deployed Cloud Run service runs with TENANT_ID env set, all requests resolving to the default tenant. Flipping to multi-tenant mode is a deployment config change with real blast-radius for existing customers.

The fix: Two-phase migration:

Phase 1 (now): all new tenants are provisioned with explicit slug routing. Existing tenants keep working via the env-default fallback.
Phase 2 (target: 2-3 weeks): drop the TENANT_ID env, require explicit slug for all requests, communicate the breaking change to existing tenants with a 2-week migration window.

Estimated impact: +2 scenarios on every ACP-paired runner (cross_tenant_isolation: 4/6 → 6/6). Brings best-case ACP score from 46 → 48.

Status: Phase 1 in progress. Phase 2 deployment ETA early May.

#3 — Task-scoped narrowing in the gateway

The gap: scope_inheritance.04_task_narrowing is declined for ACP because we enforce task-scoped narrowing in the SDK layer (where intent-aware enforcement belongs by design), but the benchmark scenario asserts a gateway-only check.

Root cause: When an orchestrator declares delegated_scopes for a subagent, the subagent shouldn’t be able to pivot to tools requiring scopes outside that set — even if the underlying user has them. We enforce this client-side via the SDK’s runner layer. The gateway doesn’t (and arguably shouldn’t) know what “delegation” means semantically.

The fix: Add a gateway-side mode where delegation chain + declared scopes propagate via headers, and the gateway intersects scope-against-required-scope per-call. Optional, opt-in per workspace. Doesn’t replace SDK enforcement; layers on top for defense in depth.

Estimated impact: +1 scenario on every ACP-paired runner. Decorator scores: 40 → 45 with #1 + #3.

Status: Design doc in progress. Q3 2026 target.

#4 — LangGraph checkpoint replay helper

The gap: LangGraph’s StateGraph checkpoint replay loses governance context — when a graph resumes from a checkpoint, no governance pipeline re-runs against the replayed state. Specific to LangGraph but real.

The fix: acp-langchain@0.2.0 adds governed_state_resume(config) — a helper that walks a checkpoint state, identifies tool outputs by trace ID, and re-evaluates each through the gateway against current policy. Returns a list of stale tool outputs that would no longer be allowed. Application decides whether to abort, redact, or proceed with annotation.

Estimated impact: Closes the LangGraph-specific staleness gap for teams using checkpoint-heavy patterns. Doesn’t affect the headline scorecard number directly but materially changes posture for long-lived LangGraph sessions.

Status: Shipping in acp-langchain@0.2.0 next week.

#5 — Audit-silence anomaly detection

The gap: Claude Code’s --dangerously-skip-permissions disables every PreToolUse hook including ACP’s. Server-side detection is hard — ACP can’t see hooks that don’t fire.

The fix: Track expected per-user call volume (rolling 7-day baseline of business-hours activity). Alert when a user with non-trivial historical activity goes silent for >2 hours during business hours. Surface as audit_silence_anomaly in the dashboard’s Activity → Anomalies tab.

This isn’t deterministic. False positives on PTO, focused-work blocks, holidays. But it’s the best signal a server-side governance plane can provide for a client-side bypass.

Estimated impact: Doesn’t change any scorecard scenario. Adds a real production capability for a known unfixable client-side gap.

Status: Implementation in progress. Shipping in 2 weeks.

#6 — MCP server-side enforcement for IDE deployments

The gap: Cursor’s internal tools (Edit, Read, Bash) bypass MCP entirely — and the same is true of any IDE that has primitive operations outside its MCP layer.

The fix isn’t a code fix. It’s a deployment-pattern guide:

Network-layer enforcement. Block production endpoints from developer machines except through proxied paths ACP fronts. Internal-tool calls that would touch production are caught at the network, not at the IDE.
Git-layer enforcement. Branch protection, required reviews on production paths, CI checks on agent-authored commits. Catches what client-side governance can’t reach.
Endpoint-level policy. MDM/EDR rules that observe IDE process invocations and alert on suspicious patterns.

Why no code fix: ACP can’t intercept what the IDE doesn’t expose. The only way to fully close this gap would be for Cursor itself to expose a hook protocol (similar to Claude Code’s PreToolUse) for internal tools. We’ve filed that feedback with Cursor.

Status: Deployment guide ships with this post — see the recommendations. Live discussions with Cursor on the hook-protocol ask.

What this means for the scorecard

If everything ships as planned, the next major scorecard refresh (target: 6 weeks out) will look like:

Framework	Today	After fixes #1-#5
CrewAI	40/48	47/48
LangGraph	40/48	47/48
Cursor	37/48	39/48 (no fix for internal-tools gap)
Claude Code	43/48	45/48
Codex CLI	43/48	45/48
OpenAI Agents SDK	45/48	47/48
Anthropic Agent SDK	46/48	48/48 ⭐
ACP (direct)	45/48	47/48

Six fixes, ranked. Two ship in two weeks. Six weeks to perfect on the cleanest path. The Cursor gap is structural — ACP can document it, build deployment patterns around it, but can’t close it without IDE cooperation.

This is what shipping a benchmark and then acting on it looks like. We’ll publish a “what the benchmark told us — fix #N landed” follow-up for each item as it ships, with the new score posted publicly.

Receipts:

Full scorecard post
Recommended deployment patterns
agentgovbench repo
Roadmap tracking issues — public, with dates

Next in series: “AgentGovBench v0.3 — what’s coming” (held-out scenarios, two new categories, vendor-contributed runners).

Share: Twitter LinkedIn