Benchmark series · Part 1 of 11

How we think about testing AI agent governance

David Crowe · April 20, 2026 · 14 min read

testing benchmark governance

tl;dr

We built AgentGovBench — an open, NIST-mapped benchmark for AI agent governance. It tests the layer between your agents and your tools: identity propagation, permission enforcement, delegation provenance, audit completeness, rate limits, cross-tenant isolation, fail-mode discipline, scope inheritance.

Then we ran it against the live ACP production gateway. The first run surfaced eleven real governance gaps — not theoretical issues, actual bugs and missing features that our users were relying on us to have. We shipped fixes for every one that was fixable within the session.

ACP now passes 45 of 48 scenarios against the deployed product, with three documented gaps that ship with the scorecard (not hidden under the rug). The benchmark code, the scenarios, the results, and the exact reasons for the three remaining failures are all public.

Anyone with a Firebase service account on an ACP project can reproduce this scorecard on their own instance in under ten minutes. If they see a different result, that’s either a version drift we haven’t caught or a gap we haven’t seen yet — either way, we want to hear about it.

This post is about why we built the benchmark, how we built it, what it found in our own product, and why we think this is the right way for an AI governance vendor to operate.

Why a new benchmark

There are good benchmarks for AI systems, and they don’t test what we do.

HarmBench, SALAD-Bench, JailbreakBench test whether a model misbehaves under adversarial input. That’s alignment.
AgentLeak, AgentDAM, PII-Scope test whether sensitive data leaks across agent channels. That’s privacy.
InjecAgent, CyberSecEval test whether tool-using agents can be tricked into misbehavior. That’s injection.

All of those measure what the model does. None measure what the governance infrastructure around the model does — identity, policy, audit, rate limits, tenant boundaries.

That’s the layer ACP lives on. It’s also the layer every enterprise buying AI agent tooling needs to evaluate. And there was no standard benchmark for it.

So we built one. The design goals:

Framework-agnostic. Scenarios describe what the governance layer should enforce, not how. A scenario runs against ACP, a competitor, a custom proxy, or nothing at all — same scenarios, different runners.
Deterministic. No LLM in the hot path for v0.x. The governance layer sees tool calls and metadata; we synthesize those directly rather than hoping a model produces them. Full benchmark reproduces byte-for-byte across runs.
Mapped to NIST AI RMF 1.0. Each scenario cites the specific control it exercises. Procurement teams can cite NIST, not us.
Vendor-neutral by construction. The reference runner is ACP. The scenarios don’t know what ACP is. Any vendor can contribute their own runner and appear on the scoreboard.
Honest about partial failures. Our own reference implementation fails three scenarios; those failures are front and center in the results file and this post.

We picked eight categories based on the threat model of multi-tenant, multi-agent systems in regulated environments:

#	Category	What breaks if this fails
1	Identity propagation	Audit attributes actions to agents, not humans
2	Per-user policy enforcement	User X performs actions X was forbidden from
3	Delegation provenance	Forensic reconstruction of multi-agent calls is impossible
4	Scope inheritance	Child agents inherit parent’s scope, privilege escalation
5	Rate limit cascade	User bypasses rate limit via subagent fan-out
6	Audit completeness	Actions happen without usable forensic records
7	Fail-mode discipline	Gateway failure behaves unpredictably
8	Cross-tenant isolation	Tenant A affects tenant B

Six scenarios per category, forty-eight total at v0.2.

How it actually works

A scenario is a YAML file. It describes:

setup: tenants, users, tools, policies
actions: a sequence of operations an adversary might attempt — direct tool calls, delegations, parallel fan-outs, gateway failures, policy changes
expected: what the governance layer must have done — allowed, denied, logged, attributed correctly

A runner implements a BaseRunner Python interface. At scenario start, it installs the policies (via whatever admin API the vendor exposes). Each action is pushed through the vendor’s real governance layer. At the end, the runner returns its tool decisions and audit log entries. The scorer compares observed against expected.

Our reference runner for ACP:

Writes each scenario’s policy to Firestore via the service account
Mints Firebase ID tokens for synthetic benchmark users (agb-alice, agb-bob, …)
Calls /govern/tool-use and /govern/tool-output on the live gateway with those tokens as Bearer auth
Reads audit entries back from Firestore’s tenants/{id}/logs collection
Emits SDK-local fallback audit entries for cases where the gateway is unreachable (fail-open scenarios)

No mocking. No stubbing. The numbers you see in the scorecard came from real HTTP round-trips to https://api.agenticcontrolplane.com, real policy writes, real audit reads.

What it found in our own product

This is the part we’re most proud of. We built the benchmark expecting it would help validate other governance products. In practice, it was most valuable for validating ours.

First run against ACP: 31 of 48 passing. Eleven real gaps.

The fixes that shipped to the deployed product over the next few hours:

Silent redact-bypass in the dashboard. The Agents tab’s “Data transform” dropdown wrote the transform field (pre-hook, for tool inputs). The post-hook reads postTransform (for tool outputs). Result: every tenant that configured “Detect & redact” in the UI had been getting audit-only on outputs. Nothing was redacted. The UI showed green. We’d never have noticed this without the benchmark. p0, shipped.
PII precision over recall. transformabl-core (our PII library) runs a broad detector set, and our first instinct was to widen it further — bank numbers, ICD codes, NPIs, passports, DOBs. Then we measured against real agent traffic and hit the opposite problem: those context-free number/ID detectors false-fire constantly on structured tool arguments — a tab ID reads as a bank number, a UUID as a passport, a timestamp as a DOB. Flagging a Bash.cd as “PII” cries wolf and erodes trust in the whole governance signal. So we shipped a precision gate (piiFilter) that post-filters detection to six high-precision types — email, credit card, SSN, IBAN, IP address, phone — the ones that need real structure (an @, a Luhn-valid card, a checksummed IBAN). Detection you can trust beats detection that’s technically broader. Shipped.
Delegation chain lost in audit. When an orchestrator agent delegated to a worker which called a tool, the audit entry had the worker’s user UID but no record of how the call got there. Multi-hop chains completely lost provenance. We added agent_chain: string[] to the hook request body and persisted it as agentChain on each audit record. Shipped.
Rate limits were per-tenant, not per-user. The rate limit bucket key was ${tenantId}:${tier} — meaning ten users in a tenant shared one 60/min budget. One user could consume the entire tenant’s quota. We fixed the key to ${tenantId}:${userUid}:${tier}. Shipped.
Rate limiter was fixed-window. Client could fire 120 calls across a window boundary against a 60/min limit. We rewrote as a sliding window. Shipped.
User-scope tool-specific overrides ignored. The policy resolution in getEffectivePolicy merged workspace and user .tools maps by key — but only tools present at both levels appeared on the merged map. A user-only tool-specific override was invisible. We fixed the gateway to consult user overrides explicitly. Shipped.
Multi-tenant hook routing. The governance hooks were only reachable at /govern/* (root). Multi-tenant users who belong to multiple workspaces couldn’t disambiguate which tenant a call belonged to. We added /:tenantSlug/govern/* mounting + an explicit-tenant preference in the identity resolver. Shipped (gateway), awaiting multi-tenant Cloud Run deploy.
Task-scoped narrowing (SDK-side). When an orchestrator declares delegated_scopes for a subagent, the subagent shouldn’t be able to pivot to tools requiring scopes outside that set — even if the underlying user has them. We implemented this in the SDK runner layer (which is where intent-aware enforcement belongs; the gateway doesn’t know the semantic intent of a delegation). Shipped.
Dashboard: PII highlighting in the activity log. Not a governance gap per se, but once we had tenants looking at audit rows with PII findings, they wanted to see which part of the text got flagged. We added a live-highlighting component that underlines detected PII by category (red for PCI, teal for HIPAA, amber for identity, indigo for contact). Shipped.
Dashboard: custom PII patterns UI. Some customers have tenant-specific canary tokens, internal ID formats, client codes that don’t fit any standard recognizer. We exposed customPiiPatterns as a first-class feature in the dashboard with a live regex test panel and a starter library (AWS keys, Stripe keys, JWTs, GitHub PATs, UUIDs). Shipped.
Runner policy-shape bug. Our own benchmark runner wrote user policies with the wrong document shape; it didn’t match what the gateway’s policy merger expected. This was a runner bug, not a product bug — but finding it validated the “eat your own dogfood” principle. Fixed.

After fixes: 45 of 48 passing.

The three we didn’t fix (and why)

Benchmarks become credible when they’re honest about partial failures. Ours ships with three scenarios ACP doesn’t pass, with the reasons visible in the scorecard:

`cross_tenant_isolation.03_user_scope_does_not_leak`

`cross_tenant_isolation.05_admin_cannot_cross`

These scenarios test that a request forged to claim a tenant the caller isn’t a member of gets rejected. The gateway code to honor path-based tenant routing is already shipped (commit a920e5a — resolveHookIdentity prefers the URL slug when set, verifies membership before honoring). But the deployed gateway runs in single-tenant mode (TENANT_ID env, all requests resolve to the default tenant). Flipping Cloud Run to multi-tenant mode is a config change with real blast-radius on existing customer traffic.

We have a plan to roll this out (migrate existing callers to include the tenant slug, then flip the mode). It’s queued. We declined both scenarios in the current scorecard with exactly this reason printed alongside the result.

`scope_inheritance.04_task_narrowing` (server-side)

The benchmark has a version of this scenario passing — because we implemented task-scope narrowing in the SDK layer (where intent-aware enforcement belongs; the gateway doesn’t and shouldn’t know what “delegation” means semantically). A variant that forces the assertion to a gateway-only check would still fail. We consider this the correct design: SDKs enforce delegation semantics, gateways enforce tenant and user scope. A future scenario may split these cleanly.

Three tiers — what each level of governance actually buys you

The more interesting scoreboard is how an AI system scores depending on the shape of governance it has. We shipped three reference runners:

Category	vanilla (no governance)	audit-only (framework default)	ACP (full governance)
audit_completeness	1/6	5/6	6/6
cross_tenant_isolation	4/6	4/6	4/6 (2 declined)
delegation_provenance	0/6	5/6	6/6
fail_mode_discipline	3/6	4/6	6/6
identity_propagation	0/6	6/6	6/6
per_user_policy_enforcement	1/6	1/6	6/6
rate_limit_cascade	3/6	3/6	5/6
scope_inheritance	1/6	1/6	6/6
total	13/48	29/48	45/48

What each tier represents:

vanilla (13/48) — the absolute floor. No audit, no enforcement, no identity, no policy. Every call allowed. This is what you have if you’re running agents without any governance layer at all. Some scenarios pass because the scenario itself doesn’t require enforcement — e.g., a benign read call correctly succeeds even in vanilla.
audit-only (29/48) — the common framework default. Every tool call logged with attribution, provenance, timestamp. Nothing denied. Nothing rate-limited. No policy enforced. This is what most agent frameworks provide out of the box. The jump from 13 to 29 is what you get from instrumenting your framework with a callback handler.
ACP (45/48) — a dedicated governance product. Adds actual enforcement on top of the audit log. The jump from 29 to 45 is 16 scenarios — every one of which asks “did the system actually stop the bad thing, not just record it?”

The 16 scenarios past 29 are the enforcement scenarios. A logging library can get you to 29. Denying a privilege escalation, narrowing a delegation’s scope, capping fan-out rate limits per user, honoring a policy revocation mid-session — those require a product that was designed to enforce, not just observe.

Specific-framework runners (CrewAI, LangGraph, Claude Agent SDK, OpenAI Agents SDK) are next-step PRs — we’ve written the audit_only runner as the synthesized “what most frameworks give you” baseline; individual runners will land as vendor-contributed PRs. If you maintain one of those frameworks and want your product represented honestly on this scoreboard, the runner contribution template is in CONTRIBUTING.md.

Why we think this is how AI governance vendors should operate

Three claims.

Claim 1: Governance products should be benchmarked against threat models, not feature lists.

Every vendor in this space has a feature matrix. “Per-user policies” checks a box. Whether the per-user policy actually enforces correctly when the user spawns ten subagents in parallel — that’s a test, not a feature. A buyer who trusts feature matrices is buying marketing; a buyer who wants guarantees needs a benchmark.

Claim 2: The benchmark should be open, reproducible, and vendor-neutral.

A benchmark only a vendor can run and interpret is marketing. A benchmark anyone can clone, install, and run against any governance product produces signal. AgentGovBench takes ten minutes to set up against your own ACP instance (or someone else’s), and the scenarios don’t care which vendor you’re testing.

Claim 3: Vendors should publish their own partial failures.

We pass 45 of 48. The three we don’t pass are in this blog post, in the repo’s results file, and in the scorecard the CLI prints at the end of every run. If we ever claim 48/48, reviewers will find the ones we’re hiding — and should. Honest reporting is the only report anyone believes.

Run it yourself against your ACP instance

If you have admin access to an ACP Firebase project:

git clone https://github.com/agentic-control-plane/agentgovbench
cd agentgovbench
python -m venv .venv && source .venv/bin/activate
pip install -e .

export GOOGLE_APPLICATION_CREDENTIALS=/path/to/firebase-service-account.json
export AGB_PROJECT=your-firebase-project-id
export AGB_EMAIL_DOMAIN=bench.yourdomain.com
export FIREBASE_WEB_API_KEY=your-public-firebase-web-api-key

python setup/bootstrap_tenant.py
# Exports: AGB_TENANT_ID, AGB_TENANT_SLUG

python -m benchmark.cli run --runner acp --out results/my-acp.json

If you pass 45/48 with the same declinations, your ACP deployment is at product parity with the reference. If you pass more, you’ve either got a newer build than prod or found a gap we haven’t — please open an issue. If you pass fewer, you’ve likely got policy configuration drift (the bootstrap writes our recommended defaults; merge-preserving behavior may have been affected by prior writes).

Run it against other governance products

The scenarios don’t know what ACP is. Implementing a runner for Guardrails AI, Credo AI, Arthur AI, OpenAI Agents SDK defaults, NVIDIA NeMo Guardrails, or Microsoft Presidio is ~200 lines of Python and an afternoon. The contribution process is in CONTRIBUTING.md. PRs welcome.

If you’re a vendor in this space and you want to be represented on the scoreboard — or want to argue that a scenario is ill-specified — we want that. The benchmark gets better with every runner, every disputed scenario, every new threat pattern someone contributes from their own production experience.

What’s next

Next week: first external runner contribution (we’ve started outreach to three open-source governance projects).
Next month: held-out scenario set published, rotation policy defined, arXiv preprint on the methodology.
Next quarter: cross-benchmark integration — AgentGovBench as one module in a combined suite with AgentLeak, InjecAgent, AgentDAM.
Ongoing: every real customer incident that surfaces a governance gap becomes a scenario. The benchmark keeps growing based on what actually breaks in production.

We built AgentGovBench because we wanted a way to be accountable to our customers. It became the single most valuable engineering tool we’ve shipped this year. If you use ACP, please run it. If you build a governance product, please implement a runner. If you buy these tools, please ask your vendors why they don’t have a scorecard yet.

Repo: github.com/agentic-control-plane/agentgovbench

Share: Twitter LinkedIn