How AgentGovBench's 48 scenarios map to NIST AI RMF 1.0
tl;dr
Every AgentGovBench scenario cites the specific NIST AI Risk Management Framework 1.0 control(s) it exercises. That’s deliberate — procurement teams cite controls, not vendor claims. This post is the full mapping.
If you’re evaluating AI agent governance products and your security team needs framework alignment, this post lets you point to “AgentGovBench scenario X.Y exercises NIST control Z” rather than “the vendor says they cover identity propagation.”
The framework families AgentGovBench touches
NIST AI RMF 1.0 organizes controls into four functions: GOVERN, MAP, MEASURE, MANAGE. AgentGovBench scenarios cite specific subcategories within each:
GOVERN — organizational policy and accountability
- GOVERN-1.1 — Legal and regulatory requirements involving AI are understood, managed, and documented
- Exercised by:
fail_mode_discipline(declared fail mode must be honored — accountability for behavior under failure)
- Exercised by:
- GOVERN-1.2 — Risk management processes are established for AI lifecycle stages
- Exercised by:
per_user_policy_enforcement,cross_tenant_isolation(policy actually enforced; tenant boundaries actually held)
- Exercised by:
- GOVERN-1.4 — Accountability structures are in place
- Exercised by:
identity_propagation(audit attributes actions to humans, not service accounts)
- Exercised by:
MAP — context and risk identification
- MAP-2.1 — Tasks for the AI system are defined; identity and context are tracked
- Exercised by:
identity_propagation.01_direct_call_attributionand others
- Exercised by:
- MAP-4.1 — Risks of negative impacts to people are identified
- Exercised by:
scope_inheritance(privilege escalation across delegations)
- Exercised by:
MEASURE — performance and effectiveness
- MEASURE-2.3 — AI system performance is tracked
- Exercised by:
delegation_provenance,audit_completeness(measurable forensic record per call)
- Exercised by:
- MEASURE-2.6 — AI system reliability is regularly assessed
- Exercised by:
identity_propagation.05_anonymous_rejected(system reliably rejects unauthenticated calls)
- Exercised by:
- MEASURE-2.7 — AI system security is regularly assessed
- Exercised by:
scope_inheritance.04_task_narrowing(subagents can’t escalate beyond declared scopes)
- Exercised by:
MANAGE — risk response and treatment
- MANAGE-2.1 — Resources are prioritized to manage AI system risks
- Exercised by:
rate_limit_cascade(a user can’t exhaust the system by spawning subagents)
- Exercised by:
Full scenario-to-NIST mapping
Each scenario YAML in scenarios/ declares its NIST citations in the nist: field. Here’s the consolidated table:
| Category | Scenario count | NIST controls cited |
|---|---|---|
audit_completeness |
6 | MEASURE-2.3, GOVERN-1.4 |
cross_tenant_isolation |
6 | GOVERN-1.2, MEASURE-2.7 |
delegation_provenance |
6 | MEASURE-2.3, GOVERN-1.4 |
fail_mode_discipline |
6 | GOVERN-1.1, MEASURE-2.6 |
identity_propagation |
6 | MAP-2.1, MEASURE-2.6, GOVERN-1.4 |
per_user_policy_enforcement |
6 | GOVERN-1.2 |
rate_limit_cascade |
6 | MANAGE-2.1 |
scope_inheritance |
6 | MAP-4.1, MEASURE-2.7 |
Why NIST mapping matters for procurement
Two scenarios for buyers:
Scenario 1 — internal: Your security team wants AI agent governance. They’ve heard about ACP, Guardrails AI, Credo AI, NeMo Guardrails. They ask “how do we evaluate these consistently?”
You point them at AgentGovBench. The scoring against your specific NIST control profile is reproducible — same scenarios, same scoring code, different vendor runners. Procurement gets to measure, not trust.
Scenario 2 — external: Your auditor (SOC 2, ISO 42001, EU AI Act) asks how your AI deployments handle identity propagation. You point them at the AgentGovBench score for your stack — say, CrewAI + ACP at 40/48 — and the per-category breakdown showing 6/6 on identity propagation. The benchmark scenarios are public and NIST-mapped; the audit conversation moves from claims to citations.
This is why we ship the NIST citations alongside the test logic. The benchmark is the artifact. NIST gives the artifact regulatory standing.
What AgentGovBench does not cover
Honest scope-limits, in case you’re doing gap analysis:
- GOVERN-2 to GOVERN-6 (organizational risk culture, third-party risk, etc.) — these are people/process controls; AgentGovBench is a runtime test
- MAP-1, MAP-3, MAP-5 (broad context, impact analysis, risk priority) — same: organizational
- MEASURE-1, MEASURE-3, MEASURE-4 (metric design, drift, qualitative measurement) — AgentGovBench is binary pass/fail at runtime, not statistical
- MANAGE-1, MANAGE-3, MANAGE-4 (acceptance criteria, stakeholder communication, post-deployment review) — process, not runtime
If you need an org-level governance assessment, AgentGovBench is one runtime input alongside your ISO 42001 / SOC 2 / EU AI Act compliance program. It’s not a replacement for any of those.
Roadmap — additional NIST coverage
Two scenario categories scoped for v0.3:
prompt_injection_resistance— exercises MAP-2.3, MEASURE-2.7. How does the governance layer behave when prompt injection attempts to manipulate tool calls?output_redaction_compliance— exercises GOVERN-1.2, MEASURE-2.3. PII / PHI / PCI in outputs must be redacted per policy; benchmark verifies the redaction actually fires.
Plus the client_bypass_disclosure category we mentioned in the dangerously-skip-permissions post: a meta-test that the governance product surfaces known bypasses to operators.
How to use this in your eval
If you’re a security/compliance team evaluating AI agent governance products:
- Pick the NIST controls you must hit. Most teams have a SOC 2 Type II profile or a regulatory audit driving this — your control owner can tell you.
- Map controls back to AgentGovBench categories. Use the table above.
- Run the benchmark against vendor runners. Compare the per-category scores for the controls that matter to you.
- Document declinations honestly. ACP ships with three documented declinations; competitor runners may have their own. A vendor that claims 48/48 with no declinations is hiding gaps — verify by re-running.
Receipts and references
- NIST AI RMF 1.0 (NIST.AI.100-1)
- NIST AI RMF Playbook
- agentgovbench repo
- Methodology post
- Full scorecard
- /benchmark page — current scores
- 1. How we think about testing AI agent governance
- 2. CrewAI scores 13/48 on AgentGovBench. With ACP, 40/48.
- 3. CrewAI's task handoffs lose the audit trail — here's the gap and the fix
- 4. LangGraph scores 13/48 on AgentGovBench. With ACP, 40/48.
- 5. LangGraph's StateGraph checkpoints don't replay through governance
- 6. Claude Code scores 13/48 on AgentGovBench. With ACP, 43/48.
- 7. Claude Code's --dangerously-skip-permissions disables every governance hook
- 8. Decorator, proxy, hook — three patterns for agent governance, three different scorecards
- 9. OpenAI Agents SDK scores 13/48 on AgentGovBench. With ACP, 45/48.
- 10. Anthropic Agent SDK scores 13/48 on AgentGovBench. With ACP, 46/48 — best of any framework.
- 11. Codex CLI scores 13/48 on AgentGovBench. With ACP, 43/48 — same as Claude Code.
- 12. Full scorecard: seven frameworks, 48 scenarios, one open benchmark
- 13. How AgentGovBench's 48 scenarios map to NIST AI RMF 1.0 · you are here
- 14. Reproduce AgentGovBench on your stack — full setup guide
- 15. Cursor scores 13/48 on AgentGovBench. With ACP MCP server, 37/48 — and that gap is structural.
- 16. Recommended governance deployment patterns — pick the one that scores highest for your stack
- 17. What our benchmark told us about our own product — six fixes we're shipping