Reproduce AgentGovBench on your stack — full setup guide
tl;dr
If you have admin access to an ACP Firebase project, you can reproduce every scorecard from this series on your own deployment in under fifteen minutes. This post: exact commands, expected output, and the gotchas that are easy to miss.
Prerequisites
You need:
- Python 3.11 or 3.13 (3.14 has tiktoken build issues with current crewai)
- A Firebase service account JSON with admin rights on your ACP project
AGB_TENANT_IDandAGB_TENANT_SLUG— generated bysetup/bootstrap_tenant.pyon first runFIREBASE_WEB_API_KEY— your project’s public Firebase web API key (from Firebase console → Project Settings → General)- A clean shell environment — no leftover
OPENAI_API_KEYorANTHROPIC_API_KEYinterfering
You do NOT need:
- An LLM API key. AgentGovBench is fully deterministic — no LLM in the hot path. The
vanilla,audit_only,acp, and per-framework runners all synthesize tool calls directly. - A production-load gateway. Even a fresh ACP install handles 48 scenarios easily.
Setup
# 1. Clone
git clone https://github.com/agentic-control-plane/agentgovbench
cd agentgovbench
# 2. Python venv with the right interpreter
python3.13 -m venv .venv
source .venv/bin/activate
pip install -e .
# 3. Install the framework you want to benchmark
pip install crewai acp-crewai # for crewai_native + crewai_acp
pip install langchain-core langgraph # for langgraph_native + langgraph_acp
# acp-langchain is bundled in via the agentgovbench install
# 4. Configure your ACP project
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/your-firebase-sa.json
export AGB_PROJECT=your-firebase-project-id
export AGB_EMAIL_DOMAIN=bench.yourdomain.com # any domain you control
export FIREBASE_WEB_API_KEY=AIzaSy...your-public-key
# 5. Bootstrap a clean benchmark tenant + synthetic users
python setup/bootstrap_tenant.py
# Prints AGB_TENANT_ID and AGB_TENANT_SLUG — capture them
The bootstrap script:
- Creates a fresh tenant in your ACP project
- Provisions five synthetic Firebase users (
agb-alice,agb-bob,agb-carol,agb-dan,agb-eve) - Writes a
setup/benchmark_env.yamlwith the tenant + user UIDs - Sets recommended default policies on the tenant
Capture the printed AGB_TENANT_ID — you’ll need it for runs:
export AGB_TENANT_ID=... # from bootstrap output
export AGB_TENANT_SLUG=agentgovbench
Running benchmarks
# Reference runners (no framework dependency)
python -m benchmark.cli run --runner vanilla --out results/vanilla.json
python -m benchmark.cli run --runner audit_only --out results/audit-only.json
python -m benchmark.cli run --runner acp --out results/acp-mine.json
# Per-framework runners (require framework installed)
python -m benchmark.cli run --runner crewai_native --out results/crewai-native-mine.json
python -m benchmark.cli run --runner crewai_acp --out results/crewai-acp-mine.json
python -m benchmark.cli run --runner langgraph_native --out results/langgraph-native-mine.json
python -m benchmark.cli run --runner langgraph_acp --out results/langgraph-acp-mine.json
# etc.
# Limit to one category for quick iteration
python -m benchmark.cli run --runner acp --category identity_propagation
Each run takes ~5-15 minutes for full 48 scenarios against the live gateway. Native runners (no gateway) finish in seconds.
Expected output
You should see output like:
[1/48] ✓ audit_completeness.01_required_fields (124ms)
[2/48] ✓ audit_completeness.02_denial_logged (215ms)
...
[48/48] ✗ scope_inheritance.06_viewer_cannot_write (98ms)
✗ tool_denied — some calls were allowed
======================================================================
AgentGovBench spec v0.2 library 2026.04
Runner: acp (Agentic Control Plane 0.4.0) — agenticcontrolplane.com
======================================================================
Category Pass Rate
--------------------------------------------------------
audit_completeness 6/6 100%
cross_tenant_isolation 4/6 67%
...
total 45/48
If you reproduce 45/48 for the acp runner with the three documented declinations, your ACP deployment is at product parity with the published reference. That’s the success case.
If you see different numbers, that’s the benchmark’s main job — either:
- You’re on an older ACP version (upgrade and rerun)
- You’ve found a real governance gap we haven’t seen yet (please file an issue!)
- Your tenant has policy configuration drift from the bootstrap defaults (re-run bootstrap)
Common issues
ModuleNotFoundError: firebase_admin — Run pip install firebase-admin requests pyyaml (these are dependencies of the runner but pip occasionally misses them in editable installs).
Could not load Firebase service account — Check GOOGLE_APPLICATION_CREDENTIALS points at a valid JSON file. Verify with gcloud auth application-default print-access-token (separate test, but should succeed).
401 from /govern/tool-use — The minted Firebase ID token is invalid or signed by an IdP your ACP doesn’t trust. Verify FIREBASE_WEB_API_KEY is correct and matches the project of GOOGLE_APPLICATION_CREDENTIALS.
Every call shows decision: allow with reason fail-open — The runner can’t reach the gateway. Check ACP_BASE_URL (defaults to https://api.agenticcontrolplane.com) and curl -sf $ACP_BASE_URL/health.
Rate limit scenarios noisy — If you ran multiple ACP-paired runners in parallel, the rate-limit scenarios will see inflated counts. Run them serially, or accept ~1-2 scenario noise in the rate_limit_cascade category.
Scores differ from published by 1-2 scenarios — Usually the rate_limit_cascade.01 sliding-window edge case (within 5% tolerance band per our methodology). If the difference is larger, please file an issue with your setup/benchmark_env.yaml and the failing scenario log.
Running against your own framework
The full series of per-framework scorecards uses the same workflow:
| Framework | Install command | Runner names |
|---|---|---|
| CrewAI | pip install crewai acp-crewai |
crewai_native, crewai_acp |
| LangChain/LangGraph | pip install langchain-core langgraph |
langgraph_native, langgraph_acp |
| Claude Code | (CLI, no pip install) | claude_code_native, claude_code_acp |
| Codex CLI | (CLI, no pip install) | codex_native, codex_acp |
| OpenAI Agents SDK | pip install openai openai-agents |
openai_agents_native, openai_agents_acp |
| Anthropic Agent SDK | (TS, runner is Python representation) | anthropic_agent_sdk_native, anthropic_agent_sdk_acp |
| Cursor | (IDE, runner is Python representation) | cursor_native, cursor_acp |
For the TS frameworks (Anthropic Agent SDK), the runner is a Python representation of the dispatch path — same gateway observables as if you’d actually run the TS code, but reproducible without a Node toolchain.
Running against a competitor governance product
Implement the BaseRunner interface in benchmark/runner.py. Drop your runner in runners/<your-product>.py. ~200 lines of Python.
from benchmark.runner import RunnerMetadata, StatefulRunner
class Runner(StatefulRunner):
@property
def metadata(self) -> RunnerMetadata:
return RunnerMetadata(
name="your_product",
version="1.0",
product="Your Governance Product",
vendor="yourcompany.com",
)
def execute_action(self, action):
# Push the action through your product's governance pipeline
# Return a ToolOutcome
...
def audit_log(self):
# Return list[AuditEntry] from your product's audit
...
PR your runner to the repo. We’ll review and merge if it passes the contribution checks.
Receipts and references
- agentgovbench repo — clone this
- setup/bootstrap_tenant.py — what the bootstrap does
- CONTRIBUTING.md — runner contribution template
- Methodology post — why we built it this way
- NIST mapping — control coverage for procurement
- Full scorecard
- 1. How we think about testing AI agent governance
- 2. CrewAI scores 13/48 on AgentGovBench. With ACP, 40/48.
- 3. CrewAI's task handoffs lose the audit trail — here's the gap and the fix
- 4. LangGraph scores 13/48 on AgentGovBench. With ACP, 40/48.
- 5. LangGraph's StateGraph checkpoints don't replay through governance
- 6. Claude Code scores 13/48 on AgentGovBench. With ACP, 43/48.
- 7. Claude Code's --dangerously-skip-permissions disables every governance hook
- 8. Decorator, proxy, hook — three patterns for agent governance, three different scorecards
- 9. OpenAI Agents SDK scores 13/48 on AgentGovBench. With ACP, 45/48.
- 10. Anthropic Agent SDK scores 13/48 on AgentGovBench. With ACP, 46/48 — best of any framework.
- 11. Codex CLI scores 13/48 on AgentGovBench. With ACP, 43/48 — same as Claude Code.
- 12. Full scorecard: seven frameworks, 48 scenarios, one open benchmark
- 13. How AgentGovBench's 48 scenarios map to NIST AI RMF 1.0
- 14. Reproduce AgentGovBench on your stack — full setup guide · you are here
- 15. Cursor scores 13/48 on AgentGovBench. With ACP MCP server, 37/48 — and that gap is structural.
- 16. Recommended governance deployment patterns — pick the one that scores highest for your stack
- 17. What our benchmark told us about our own product — six fixes we're shipping