How agent frameworks score on governance.
48 scenarios. 8 categories. NIST-mapped. Deterministic. AgentGovBench tests what governance actually happens when an agent makes a tool call — identity propagation, policy enforcement, delegation provenance, rate limits, audit completeness, cross-tenant isolation, fail-mode discipline, scope inheritance.
Three tiers. Sixteen scenarios separate logging from enforcement.
Every governance product sits somewhere on this scoreboard. The gap between audit-only (what most framework defaults look like) and ACP (full enforcement) is where governance products actually earn their keep.
| Category | vanilla no governance |
audit-only framework default |
ACP full enforcement |
|---|---|---|---|
| Audit completeness | 1/6 | 5/6 | 6/6 |
| Cross-tenant isolation | 4/6 | 4/6 | 4/6* |
| Delegation provenance | 0/6 | 5/6 | 6/6 |
| Fail-mode discipline | 3/6 | 4/6 | 6/6 |
| Identity propagation | 0/6 | 6/6 | 6/6 |
| Per-user policy enforcement | 1/6 | 1/6 | 6/6 |
| Rate-limit cascade | 3/6 | 3/6 | 5/6 |
| Scope inheritance | 1/6 | 1/6 | 6/6 |
| Total | 13/48 | 29/48 | 45/48 |
*ACP declines 2 cross-tenant scenarios on the current deployment (single-tenant mode) and 1 scope-inheritance variant. All declinations are documented in the scorecard JSON — no hidden failures.
The jump from 29 to 45 is sixteen scenarios of enforcement.
What each framework scores, with and without ACP.
Framework runners coming online one by one. Each runner follows the existing BaseRunner pattern — deterministic, no LLM in the loop, real gateway calls. Track each landing on the blog.
| Framework | Native | With ACP | Pattern | Status |
|---|---|---|---|---|
| CrewAI | 13/48 | 40/48 | Decorator (@governed) | Live |
| LangChain / LangGraph | 13/48 | 40/48 | Decorator | Live |
| Claude Code | 13/48 | 43/48 | Hook | Live |
| OpenAI Agents SDK | 13/48 | 45/48 | Proxy | Live |
| Anthropic Agent SDK | 13/48 | 46/48 | Decorator | Live |
| Cursor | 13/48 | — | MCP | Queued |
| OpenAI Codex CLI | 13/48 | — | Hook | In progress |
How we designed these 48 scenarios.
The full methodology — why we picked these eight categories, what each scenario tests, how to contribute new scenarios or runners — is in the launch post and the repo’s METHODOLOGY.md.
Reproduce the scorecard on your own deployment.
Clone the repo, point at your ACP instance, run the benchmark. Same runner, same scenarios, same numbers. If you see different results, that’s either version drift or a gap we haven’t found.