Skip to content
Agentic Control Plane
AgentGovBench v0.2

How agent frameworks score on governance.

48 scenarios. 8 categories. NIST-mapped. Deterministic. AgentGovBench tests what governance actually happens when an agent makes a tool call — identity propagation, policy enforcement, delegation provenance, rate limits, audit completeness, cross-tenant isolation, fail-mode discipline, scope inheritance.

The governance tiers

Three tiers. Sixteen scenarios separate logging from enforcement.

Every governance product sits somewhere on this scoreboard. The gap between audit-only (what most framework defaults look like) and ACP (full enforcement) is where governance products actually earn their keep.

Category vanilla
no governance
audit-only
framework default
ACP
full enforcement
Audit completeness1/65/66/6
Cross-tenant isolation4/64/64/6*
Delegation provenance0/65/66/6
Fail-mode discipline3/64/66/6
Identity propagation0/66/66/6
Per-user policy enforcement1/61/66/6
Rate-limit cascade3/63/65/6
Scope inheritance1/61/66/6
Total 13/48 29/48 45/48

*ACP declines 2 cross-tenant scenarios on the current deployment (single-tenant mode) and 1 scope-inheritance variant. All declinations are documented in the scorecard JSON — no hidden failures.

What the tiers actually buy you

The jump from 29 to 45 is sixteen scenarios of enforcement.

vanilla · 13/48
No audit, no enforcement, no identity, no policy. The floor. Some scenarios pass because they don’t require enforcement — a benign read call is correctly allowed even here.
audit-only · 29/48
Every call logged with attribution, provenance, timestamp. Nothing denied. Nothing rate-limited. No policy enforced. This is most frameworks’ realistic ceiling out of the box.
ACP · 45/48
Actual enforcement on top of audit. Per-user policies, scope narrowing, rate-limit cascades, fail-mode discipline. The 16-scenario jump separates logging libraries from governance products.
Per-framework scorecards

What each framework scores, with and without ACP.

Framework runners coming online one by one. Each runner follows the existing BaseRunner pattern — deterministic, no LLM in the loop, real gateway calls. Track each landing on the blog.

Framework Native With ACP Pattern Status
CrewAI13/4840/48Decorator (@governed)Live
LangChain / LangGraph13/4840/48DecoratorLive
Claude Code13/4843/48HookLive
OpenAI Agents SDK13/4845/48ProxyLive
Anthropic Agent SDK13/4846/48DecoratorLive
Cursor13/48MCPQueued
OpenAI Codex CLI13/48HookIn progress
Methodology

How we designed these 48 scenarios.

Framework-agnostic
Scenarios describe what the governance layer should enforce, not how. Same scenarios run against ACP, a competitor, or nothing at all.
Deterministic
No LLM in the hot path. Tool calls are synthesized directly. The full benchmark reproduces byte-for-byte.
NIST-mapped
Each scenario cites the specific NIST AI RMF 1.0 control it exercises. Procurement teams cite NIST, not us.
Honest about failures
ACP ships with 3 documented declinations. They’re in the scorecard, the repo, and this page. No hidden failures.

The full methodology — why we picked these eight categories, what each scenario tests, how to contribute new scenarios or runners — is in the launch post and the repo’s METHODOLOGY.md.

Reproduce the scorecard on your own deployment.

Clone the repo, point at your ACP instance, run the benchmark. Same runner, same scenarios, same numbers. If you see different results, that’s either version drift or a gap we haven’t found.