AgentGovBench v0.2

How agent frameworks score on governance.

48 scenarios. 8 categories. NIST-mapped. Deterministic. AgentGovBench tests what governance actually happens when an agent makes a tool call — identity propagation, policy enforcement, delegation provenance, rate limits, audit completeness, cross-tenant isolation, fail-mode discipline, scope inheritance.

View the benchmark → Read the methodology →

The governance tiers

Three tiers. Sixteen scenarios separate logging from enforcement.

Every governance product sits somewhere on this scoreboard. The gap between audit-only (what most framework defaults look like) and ACP (full enforcement) is where governance products actually earn their keep.

Category	vanilla no governance	audit-only framework default	ACP full enforcement
Audit completeness	1/6	5/6	6/6
Cross-tenant isolation	4/6	4/6	4/6^*
Delegation provenance	0/6	5/6	6/6
Fail-mode discipline	3/6	4/6	6/6
Identity propagation	0/6	6/6	6/6
Per-user policy enforcement	1/6	1/6	6/6
Rate-limit cascade	3/6	3/6	5/6
Scope inheritance	1/6	1/6	6/6
Total	13/48	29/48	45/48

^*ACP declines 2 cross-tenant scenarios on the current deployment (single-tenant mode) and 1 scope-inheritance variant. All declinations are documented in the scorecard JSON — no hidden failures.

What the tiers actually buy you

The jump from 29 to 45 is sixteen scenarios of enforcement.

vanilla · 13/48

No audit, no enforcement, no identity, no policy. The floor. Some scenarios pass because they don’t require enforcement — a benign read call is correctly allowed even here.

audit-only · 29/48

Every call logged with attribution, provenance, timestamp. Nothing denied. Nothing rate-limited. No policy enforced. This is most frameworks’ realistic ceiling out of the box.

ACP · 45/48

Actual enforcement on top of audit. Per-user policies, scope narrowing, rate-limit cascades, fail-mode discipline. The 16-scenario jump separates logging libraries from governance products.

Per-framework scorecards

What each framework scores, with and without ACP.

Framework runners coming online one by one. Each runner follows the existing BaseRunner pattern — deterministic, no LLM in the loop, real gateway calls. Track each landing on the blog.

Framework	Native	With ACP	Pattern	Status
CrewAI	13/48	40/48	Decorator (`@governed`)	Live
LangChain / LangGraph	13/48	40/48	Decorator	Live
Claude Code	13/48	43/48	Hook	Live
OpenAI Agents SDK	13/48	45/48	Proxy	Live
Anthropic Agent SDK	13/48	46/48	Decorator	Live
Cursor	13/48	37/48	MCP	Live
OpenAI Codex CLI	13/48	43/48	Hook	Live

Methodology

How we designed these 48 scenarios.

Framework-agnostic

Scenarios describe what the governance layer should enforce, not how. Same scenarios run against ACP, a competitor, or nothing at all.

Deterministic

No LLM in the hot path. Tool calls are synthesized directly. The full benchmark reproduces byte-for-byte.

NIST-mapped

Each scenario cites the specific NIST AI RMF 1.0 control it exercises. Procurement teams cite NIST, not us.

Honest about failures

ACP ships with 3 documented declinations. They’re in the scorecard, the repo, and this page. No hidden failures.

The full methodology — why we picked these eight categories, what each scenario tests, how to contribute new scenarios or runners — is in the launch post and the repo’s METHODOLOGY.md.

Reproduce the scorecard on your own deployment.

Clone the repo, point at your ACP instance, run the benchmark. Same runner, same scenarios, same numbers. If you see different results, that’s either version drift or a gap we haven’t found.

Clone on GitHub → Read the methodology →