Skip to content
Agentic Control Plane
Benchmark series · Part 14 of 17
AgentGovBench →

Reproduce AgentGovBench on your stack — full setup guide

David Crowe · · 6 min read
tutorial benchmark reproducibility agentgovbench

tl;dr

If you have admin access to an ACP Firebase project, you can reproduce every scorecard from this series on your own deployment in under fifteen minutes. This post: exact commands, expected output, and the gotchas that are easy to miss.

Prerequisites

You need:

  • Python 3.11 or 3.13 (3.14 has tiktoken build issues with current crewai)
  • A Firebase service account JSON with admin rights on your ACP project
  • AGB_TENANT_ID and AGB_TENANT_SLUG — generated by setup/bootstrap_tenant.py on first run
  • FIREBASE_WEB_API_KEY — your project’s public Firebase web API key (from Firebase console → Project Settings → General)
  • A clean shell environment — no leftover OPENAI_API_KEY or ANTHROPIC_API_KEY interfering

You do NOT need:

  • An LLM API key. AgentGovBench is fully deterministic — no LLM in the hot path. The vanilla, audit_only, acp, and per-framework runners all synthesize tool calls directly.
  • A production-load gateway. Even a fresh ACP install handles 48 scenarios easily.

Setup

# 1. Clone
git clone https://github.com/agentic-control-plane/agentgovbench
cd agentgovbench

# 2. Python venv with the right interpreter
python3.13 -m venv .venv
source .venv/bin/activate
pip install -e .

# 3. Install the framework you want to benchmark
pip install crewai acp-crewai           # for crewai_native + crewai_acp
pip install langchain-core langgraph    # for langgraph_native + langgraph_acp
# acp-langchain is bundled in via the agentgovbench install

# 4. Configure your ACP project
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/your-firebase-sa.json
export AGB_PROJECT=your-firebase-project-id
export AGB_EMAIL_DOMAIN=bench.yourdomain.com   # any domain you control
export FIREBASE_WEB_API_KEY=AIzaSy...your-public-key

# 5. Bootstrap a clean benchmark tenant + synthetic users
python setup/bootstrap_tenant.py
# Prints AGB_TENANT_ID and AGB_TENANT_SLUG — capture them

The bootstrap script:

  • Creates a fresh tenant in your ACP project
  • Provisions five synthetic Firebase users (agb-alice, agb-bob, agb-carol, agb-dan, agb-eve)
  • Writes a setup/benchmark_env.yaml with the tenant + user UIDs
  • Sets recommended default policies on the tenant

Capture the printed AGB_TENANT_ID — you’ll need it for runs:

export AGB_TENANT_ID=...   # from bootstrap output
export AGB_TENANT_SLUG=agentgovbench

Running benchmarks

# Reference runners (no framework dependency)
python -m benchmark.cli run --runner vanilla       --out results/vanilla.json
python -m benchmark.cli run --runner audit_only    --out results/audit-only.json
python -m benchmark.cli run --runner acp           --out results/acp-mine.json

# Per-framework runners (require framework installed)
python -m benchmark.cli run --runner crewai_native --out results/crewai-native-mine.json
python -m benchmark.cli run --runner crewai_acp    --out results/crewai-acp-mine.json
python -m benchmark.cli run --runner langgraph_native --out results/langgraph-native-mine.json
python -m benchmark.cli run --runner langgraph_acp    --out results/langgraph-acp-mine.json
# etc.

# Limit to one category for quick iteration
python -m benchmark.cli run --runner acp --category identity_propagation

Each run takes ~5-15 minutes for full 48 scenarios against the live gateway. Native runners (no gateway) finish in seconds.

Expected output

You should see output like:

[1/48] ✓ audit_completeness.01_required_fields  (124ms)
[2/48] ✓ audit_completeness.02_denial_logged    (215ms)
...
[48/48] ✗ scope_inheritance.06_viewer_cannot_write  (98ms)
    ✗ tool_denied — some calls were allowed

======================================================================
AgentGovBench  spec v0.2  library 2026.04
Runner: acp (Agentic Control Plane 0.4.0) — agenticcontrolplane.com
======================================================================

Category                               Pass     Rate
--------------------------------------------------------
audit_completeness                     6/6      100%
cross_tenant_isolation                 4/6       67%
...
total                                 45/48

If you reproduce 45/48 for the acp runner with the three documented declinations, your ACP deployment is at product parity with the published reference. That’s the success case.

If you see different numbers, that’s the benchmark’s main job — either:

  • You’re on an older ACP version (upgrade and rerun)
  • You’ve found a real governance gap we haven’t seen yet (please file an issue!)
  • Your tenant has policy configuration drift from the bootstrap defaults (re-run bootstrap)

Common issues

ModuleNotFoundError: firebase_admin — Run pip install firebase-admin requests pyyaml (these are dependencies of the runner but pip occasionally misses them in editable installs).

Could not load Firebase service account — Check GOOGLE_APPLICATION_CREDENTIALS points at a valid JSON file. Verify with gcloud auth application-default print-access-token (separate test, but should succeed).

401 from /govern/tool-use — The minted Firebase ID token is invalid or signed by an IdP your ACP doesn’t trust. Verify FIREBASE_WEB_API_KEY is correct and matches the project of GOOGLE_APPLICATION_CREDENTIALS.

Every call shows decision: allow with reason fail-open — The runner can’t reach the gateway. Check ACP_BASE_URL (defaults to https://api.agenticcontrolplane.com) and curl -sf $ACP_BASE_URL/health.

Rate limit scenarios noisy — If you ran multiple ACP-paired runners in parallel, the rate-limit scenarios will see inflated counts. Run them serially, or accept ~1-2 scenario noise in the rate_limit_cascade category.

Scores differ from published by 1-2 scenarios — Usually the rate_limit_cascade.01 sliding-window edge case (within 5% tolerance band per our methodology). If the difference is larger, please file an issue with your setup/benchmark_env.yaml and the failing scenario log.

Running against your own framework

The full series of per-framework scorecards uses the same workflow:

Framework Install command Runner names
CrewAI pip install crewai acp-crewai crewai_native, crewai_acp
LangChain/LangGraph pip install langchain-core langgraph langgraph_native, langgraph_acp
Claude Code (CLI, no pip install) claude_code_native, claude_code_acp
Codex CLI (CLI, no pip install) codex_native, codex_acp
OpenAI Agents SDK pip install openai openai-agents openai_agents_native, openai_agents_acp
Anthropic Agent SDK (TS, runner is Python representation) anthropic_agent_sdk_native, anthropic_agent_sdk_acp
Cursor (IDE, runner is Python representation) cursor_native, cursor_acp

For the TS frameworks (Anthropic Agent SDK), the runner is a Python representation of the dispatch path — same gateway observables as if you’d actually run the TS code, but reproducible without a Node toolchain.

Running against a competitor governance product

Implement the BaseRunner interface in benchmark/runner.py. Drop your runner in runners/<your-product>.py. ~200 lines of Python.

from benchmark.runner import RunnerMetadata, StatefulRunner

class Runner(StatefulRunner):
    @property
    def metadata(self) -> RunnerMetadata:
        return RunnerMetadata(
            name="your_product",
            version="1.0",
            product="Your Governance Product",
            vendor="yourcompany.com",
        )

    def execute_action(self, action):
        # Push the action through your product's governance pipeline
        # Return a ToolOutcome
        ...

    def audit_log(self):
        # Return list[AuditEntry] from your product's audit
        ...

PR your runner to the repo. We’ll review and merge if it passes the contribution checks.

Receipts and references

Share: Twitter LinkedIn
More in AgentGovBench
  1. 1. How we think about testing AI agent governance
  2. 2. CrewAI scores 13/48 on AgentGovBench. With ACP, 40/48.
  3. 3. CrewAI's task handoffs lose the audit trail — here's the gap and the fix
  4. 4. LangGraph scores 13/48 on AgentGovBench. With ACP, 40/48.
  5. 5. LangGraph's StateGraph checkpoints don't replay through governance
  6. 6. Claude Code scores 13/48 on AgentGovBench. With ACP, 43/48.
  7. 7. Claude Code's --dangerously-skip-permissions disables every governance hook
  8. 8. Decorator, proxy, hook — three patterns for agent governance, three different scorecards
  9. 9. OpenAI Agents SDK scores 13/48 on AgentGovBench. With ACP, 45/48.
  10. 10. Anthropic Agent SDK scores 13/48 on AgentGovBench. With ACP, 46/48 — best of any framework.
  11. 11. Codex CLI scores 13/48 on AgentGovBench. With ACP, 43/48 — same as Claude Code.
  12. 12. Full scorecard: seven frameworks, 48 scenarios, one open benchmark
  13. 13. How AgentGovBench's 48 scenarios map to NIST AI RMF 1.0
  14. 14. Reproduce AgentGovBench on your stack — full setup guide · you are here
  15. 15. Cursor scores 13/48 on AgentGovBench. With ACP MCP server, 37/48 — and that gap is structural.
  16. 16. Recommended governance deployment patterns — pick the one that scores highest for your stack
  17. 17. What our benchmark told us about our own product — six fixes we're shipping
Related posts

← back to blog