Benchmark series · Part 13 of 11

How AgentGovBench's 48 scenarios map to NIST AI RMF 1.0

David Crowe · April 20, 2026 · 4 min read

nist ai-rmf governance compliance agentgovbench methodology

tl;dr

Every AgentGovBench scenario cites the specific NIST AI Risk Management Framework 1.0 control(s) it exercises. That’s deliberate — procurement teams cite controls, not vendor claims. This post is the full mapping.

If you’re evaluating AI agent governance products and your security team needs framework alignment, this post lets you point to “AgentGovBench scenario X.Y exercises NIST control Z” rather than “the vendor says they cover identity propagation.”

The framework families AgentGovBench touches

NIST AI RMF 1.0 organizes controls into four functions: GOVERN, MAP, MEASURE, MANAGE. AgentGovBench scenarios cite specific subcategories within each:

GOVERN — organizational policy and accountability

GOVERN-1.1 — Legal and regulatory requirements involving AI are understood, managed, and documented
- Exercised by: fail_mode_discipline (declared fail mode must be honored — accountability for behavior under failure)
GOVERN-1.2 — Risk management processes are established for AI lifecycle stages
- Exercised by: per_user_policy_enforcement, cross_tenant_isolation (policy actually enforced; tenant boundaries actually held)
GOVERN-1.4 — Accountability structures are in place
- Exercised by: identity_propagation (audit attributes actions to humans, not service accounts)

MAP — context and risk identification

MAP-2.1 — Tasks for the AI system are defined; identity and context are tracked
- Exercised by: identity_propagation.01_direct_call_attribution and others
MAP-4.1 — Risks of negative impacts to people are identified
- Exercised by: scope_inheritance (privilege escalation across delegations)

MEASURE — performance and effectiveness

MEASURE-2.3 — AI system performance is tracked
- Exercised by: delegation_provenance, audit_completeness (measurable forensic record per call)
MEASURE-2.6 — AI system reliability is regularly assessed
- Exercised by: identity_propagation.05_anonymous_rejected (system reliably rejects unauthenticated calls)
MEASURE-2.7 — AI system security is regularly assessed
- Exercised by: scope_inheritance.04_task_narrowing (subagents can’t escalate beyond declared scopes)

MANAGE — risk response and treatment

MANAGE-2.1 — Resources are prioritized to manage AI system risks
- Exercised by: rate_limit_cascade (a user can’t exhaust the system by spawning subagents)

Full scenario-to-NIST mapping

Each scenario YAML in scenarios/ declares its NIST citations in the nist: field. Here’s the consolidated table:

Category	Scenario count	NIST controls cited
`audit_completeness`	6	MEASURE-2.3, GOVERN-1.4
`cross_tenant_isolation`	6	GOVERN-1.2, MEASURE-2.7
`delegation_provenance`	6	MEASURE-2.3, GOVERN-1.4
`fail_mode_discipline`	6	GOVERN-1.1, MEASURE-2.6
`identity_propagation`	6	MAP-2.1, MEASURE-2.6, GOVERN-1.4
`per_user_policy_enforcement`	6	GOVERN-1.2
`rate_limit_cascade`	6	MANAGE-2.1
`scope_inheritance`	6	MAP-4.1, MEASURE-2.7

Why NIST mapping matters for procurement

Two scenarios for buyers:

Scenario 1 — internal: Your security team wants AI agent governance. They’ve heard about ACP, Guardrails AI, Credo AI, NeMo Guardrails. They ask “how do we evaluate these consistently?”

You point them at AgentGovBench. The scoring against your specific NIST control profile is reproducible — same scenarios, same scoring code, different vendor runners. Procurement gets to measure, not trust.

Scenario 2 — external: Your auditor (SOC 2, ISO 42001, EU AI Act) asks how your AI deployments handle identity propagation. You point them at the AgentGovBench score for your stack — say, CrewAI + ACP at 40/48 — and the per-category breakdown showing 6/6 on identity propagation. The benchmark scenarios are public and NIST-mapped; the audit conversation moves from claims to citations.

This is why we ship the NIST citations alongside the test logic. The benchmark is the artifact. NIST gives the artifact regulatory standing.

What AgentGovBench does not cover

Honest scope-limits, in case you’re doing gap analysis:

GOVERN-2 to GOVERN-6 (organizational risk culture, third-party risk, etc.) — these are people/process controls; AgentGovBench is a runtime test
MAP-1, MAP-3, MAP-5 (broad context, impact analysis, risk priority) — same: organizational
MEASURE-1, MEASURE-3, MEASURE-4 (metric design, drift, qualitative measurement) — AgentGovBench is binary pass/fail at runtime, not statistical
MANAGE-1, MANAGE-3, MANAGE-4 (acceptance criteria, stakeholder communication, post-deployment review) — process, not runtime

If you need an org-level governance assessment, AgentGovBench is one runtime input alongside your ISO 42001 / SOC 2 / EU AI Act compliance program. It’s not a replacement for any of those.

Roadmap — additional NIST coverage

Two scenario categories scoped for v0.3:

prompt_injection_resistance — exercises MAP-2.3, MEASURE-2.7. How does the governance layer behave when prompt injection attempts to manipulate tool calls?
output_redaction_compliance — exercises GOVERN-1.2, MEASURE-2.3. PII / PHI / PCI in outputs must be redacted per policy; benchmark verifies the redaction actually fires.

Plus the client_bypass_disclosure category we mentioned in the dangerously-skip-permissions post: a meta-test that the governance product surfaces known bypasses to operators.

How to use this in your eval

If you’re a security/compliance team evaluating AI agent governance products:

Pick the NIST controls you must hit. Most teams have a SOC 2 Type II profile or a regulatory audit driving this — your control owner can tell you.
Map controls back to AgentGovBench categories. Use the table above.
Run the benchmark against vendor runners. Compare the per-category scores for the controls that matter to you.
Document declinations honestly. ACP ships with three documented declinations; competitor runners may have their own. A vendor that claims 48/48 with no declinations is hiding gaps — verify by re-running.

Receipts and references

Share: Twitter LinkedIn