Skip to content
Agentic Control Plane
Benchmark series · Part 13 of 17
AgentGovBench →

How AgentGovBench's 48 scenarios map to NIST AI RMF 1.0

David Crowe · · 4 min read
nist ai-rmf governance compliance agentgovbench methodology

tl;dr

Every AgentGovBench scenario cites the specific NIST AI Risk Management Framework 1.0 control(s) it exercises. That’s deliberate — procurement teams cite controls, not vendor claims. This post is the full mapping.

If you’re evaluating AI agent governance products and your security team needs framework alignment, this post lets you point to “AgentGovBench scenario X.Y exercises NIST control Z” rather than “the vendor says they cover identity propagation.”

The framework families AgentGovBench touches

NIST AI RMF 1.0 organizes controls into four functions: GOVERN, MAP, MEASURE, MANAGE. AgentGovBench scenarios cite specific subcategories within each:

GOVERN — organizational policy and accountability

  • GOVERN-1.1 — Legal and regulatory requirements involving AI are understood, managed, and documented
    • Exercised by: fail_mode_discipline (declared fail mode must be honored — accountability for behavior under failure)
  • GOVERN-1.2 — Risk management processes are established for AI lifecycle stages
    • Exercised by: per_user_policy_enforcement, cross_tenant_isolation (policy actually enforced; tenant boundaries actually held)
  • GOVERN-1.4 — Accountability structures are in place
    • Exercised by: identity_propagation (audit attributes actions to humans, not service accounts)

MAP — context and risk identification

  • MAP-2.1 — Tasks for the AI system are defined; identity and context are tracked
    • Exercised by: identity_propagation.01_direct_call_attribution and others
  • MAP-4.1 — Risks of negative impacts to people are identified
    • Exercised by: scope_inheritance (privilege escalation across delegations)

MEASURE — performance and effectiveness

  • MEASURE-2.3 — AI system performance is tracked
    • Exercised by: delegation_provenance, audit_completeness (measurable forensic record per call)
  • MEASURE-2.6 — AI system reliability is regularly assessed
    • Exercised by: identity_propagation.05_anonymous_rejected (system reliably rejects unauthenticated calls)
  • MEASURE-2.7 — AI system security is regularly assessed
    • Exercised by: scope_inheritance.04_task_narrowing (subagents can’t escalate beyond declared scopes)

MANAGE — risk response and treatment

  • MANAGE-2.1 — Resources are prioritized to manage AI system risks
    • Exercised by: rate_limit_cascade (a user can’t exhaust the system by spawning subagents)

Full scenario-to-NIST mapping

Each scenario YAML in scenarios/ declares its NIST citations in the nist: field. Here’s the consolidated table:

Category Scenario count NIST controls cited
audit_completeness 6 MEASURE-2.3, GOVERN-1.4
cross_tenant_isolation 6 GOVERN-1.2, MEASURE-2.7
delegation_provenance 6 MEASURE-2.3, GOVERN-1.4
fail_mode_discipline 6 GOVERN-1.1, MEASURE-2.6
identity_propagation 6 MAP-2.1, MEASURE-2.6, GOVERN-1.4
per_user_policy_enforcement 6 GOVERN-1.2
rate_limit_cascade 6 MANAGE-2.1
scope_inheritance 6 MAP-4.1, MEASURE-2.7

Why NIST mapping matters for procurement

Two scenarios for buyers:

Scenario 1 — internal: Your security team wants AI agent governance. They’ve heard about ACP, Guardrails AI, Credo AI, NeMo Guardrails. They ask “how do we evaluate these consistently?”

You point them at AgentGovBench. The scoring against your specific NIST control profile is reproducible — same scenarios, same scoring code, different vendor runners. Procurement gets to measure, not trust.

Scenario 2 — external: Your auditor (SOC 2, ISO 42001, EU AI Act) asks how your AI deployments handle identity propagation. You point them at the AgentGovBench score for your stack — say, CrewAI + ACP at 40/48 — and the per-category breakdown showing 6/6 on identity propagation. The benchmark scenarios are public and NIST-mapped; the audit conversation moves from claims to citations.

This is why we ship the NIST citations alongside the test logic. The benchmark is the artifact. NIST gives the artifact regulatory standing.

What AgentGovBench does not cover

Honest scope-limits, in case you’re doing gap analysis:

  • GOVERN-2 to GOVERN-6 (organizational risk culture, third-party risk, etc.) — these are people/process controls; AgentGovBench is a runtime test
  • MAP-1, MAP-3, MAP-5 (broad context, impact analysis, risk priority) — same: organizational
  • MEASURE-1, MEASURE-3, MEASURE-4 (metric design, drift, qualitative measurement) — AgentGovBench is binary pass/fail at runtime, not statistical
  • MANAGE-1, MANAGE-3, MANAGE-4 (acceptance criteria, stakeholder communication, post-deployment review) — process, not runtime

If you need an org-level governance assessment, AgentGovBench is one runtime input alongside your ISO 42001 / SOC 2 / EU AI Act compliance program. It’s not a replacement for any of those.

Roadmap — additional NIST coverage

Two scenario categories scoped for v0.3:

  • prompt_injection_resistance — exercises MAP-2.3, MEASURE-2.7. How does the governance layer behave when prompt injection attempts to manipulate tool calls?
  • output_redaction_compliance — exercises GOVERN-1.2, MEASURE-2.3. PII / PHI / PCI in outputs must be redacted per policy; benchmark verifies the redaction actually fires.

Plus the client_bypass_disclosure category we mentioned in the dangerously-skip-permissions post: a meta-test that the governance product surfaces known bypasses to operators.

How to use this in your eval

If you’re a security/compliance team evaluating AI agent governance products:

  1. Pick the NIST controls you must hit. Most teams have a SOC 2 Type II profile or a regulatory audit driving this — your control owner can tell you.
  2. Map controls back to AgentGovBench categories. Use the table above.
  3. Run the benchmark against vendor runners. Compare the per-category scores for the controls that matter to you.
  4. Document declinations honestly. ACP ships with three documented declinations; competitor runners may have their own. A vendor that claims 48/48 with no declinations is hiding gaps — verify by re-running.

Receipts and references

Share: Twitter LinkedIn
More in AgentGovBench
  1. 1. How we think about testing AI agent governance
  2. 2. CrewAI scores 13/48 on AgentGovBench. With ACP, 40/48.
  3. 3. CrewAI's task handoffs lose the audit trail — here's the gap and the fix
  4. 4. LangGraph scores 13/48 on AgentGovBench. With ACP, 40/48.
  5. 5. LangGraph's StateGraph checkpoints don't replay through governance
  6. 6. Claude Code scores 13/48 on AgentGovBench. With ACP, 43/48.
  7. 7. Claude Code's --dangerously-skip-permissions disables every governance hook
  8. 8. Decorator, proxy, hook — three patterns for agent governance, three different scorecards
  9. 9. OpenAI Agents SDK scores 13/48 on AgentGovBench. With ACP, 45/48.
  10. 10. Anthropic Agent SDK scores 13/48 on AgentGovBench. With ACP, 46/48 — best of any framework.
  11. 11. Codex CLI scores 13/48 on AgentGovBench. With ACP, 43/48 — same as Claude Code.
  12. 12. Full scorecard: seven frameworks, 48 scenarios, one open benchmark
  13. 13. How AgentGovBench's 48 scenarios map to NIST AI RMF 1.0 · you are here
  14. 14. Reproduce AgentGovBench on your stack — full setup guide
  15. 15. Cursor scores 13/48 on AgentGovBench. With ACP MCP server, 37/48 — and that gap is structural.
  16. 16. Recommended governance deployment patterns — pick the one that scores highest for your stack
  17. 17. What our benchmark told us about our own product — six fixes we're shipping
Related posts

← back to blog