Benchmark series · Part 14 of 11

Reproduce AgentGovBench on your stack — full setup guide

David Crowe · April 20, 2026 · 6 min read

tutorial benchmark reproducibility agentgovbench

tl;dr

If you have admin access to an ACP Firebase project, you can reproduce every scorecard from this series on your own deployment in under fifteen minutes. This post: exact commands, expected output, and the gotchas that are easy to miss.

Prerequisites

You need:

Python 3.11 or 3.13 (3.14 has tiktoken build issues with current crewai)
A Firebase service account JSON with admin rights on your ACP project
AGB_TENANT_ID and AGB_TENANT_SLUG — generated by setup/bootstrap_tenant.py on first run
FIREBASE_WEB_API_KEY — your project’s public Firebase web API key (from Firebase console → Project Settings → General)
A clean shell environment — no leftover OPENAI_API_KEY or ANTHROPIC_API_KEY interfering

You do NOT need:

An LLM API key. AgentGovBench is fully deterministic — no LLM in the hot path. The vanilla, audit_only, acp, and per-framework runners all synthesize tool calls directly.
A production-load gateway. Even a fresh ACP install handles 48 scenarios easily.

Setup

# 1. Clone
git clone https://github.com/agentic-control-plane/agentgovbench
cd agentgovbench

# 2. Python venv with the right interpreter
python3.13 -m venv .venv
source .venv/bin/activate
pip install -e .

# 3. Install the framework you want to benchmark
pip install crewai acp-crewai           # for crewai_native + crewai_acp
pip install langchain-core langgraph    # for langgraph_native + langgraph_acp
# acp-langchain is bundled in via the agentgovbench install

# 4. Configure your ACP project
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/your-firebase-sa.json
export AGB_PROJECT=your-firebase-project-id
export AGB_EMAIL_DOMAIN=bench.yourdomain.com   # any domain you control
export FIREBASE_WEB_API_KEY=AIzaSy...your-public-key

# 5. Bootstrap a clean benchmark tenant + synthetic users
python setup/bootstrap_tenant.py
# Prints AGB_TENANT_ID and AGB_TENANT_SLUG — capture them

The bootstrap script:

Creates a fresh tenant in your ACP project
Provisions five synthetic Firebase users (agb-alice, agb-bob, agb-carol, agb-dan, agb-eve)
Writes a setup/benchmark_env.yaml with the tenant + user UIDs
Sets recommended default policies on the tenant

Capture the printed AGB_TENANT_ID — you’ll need it for runs:

export AGB_TENANT_ID=...   # from bootstrap output
export AGB_TENANT_SLUG=agentgovbench

Running benchmarks

# Reference runners (no framework dependency)
python -m benchmark.cli run --runner vanilla       --out results/vanilla.json
python -m benchmark.cli run --runner audit_only    --out results/audit-only.json
python -m benchmark.cli run --runner acp           --out results/acp-mine.json

# Per-framework runners (require framework installed)
python -m benchmark.cli run --runner crewai_native --out results/crewai-native-mine.json
python -m benchmark.cli run --runner crewai_acp    --out results/crewai-acp-mine.json
python -m benchmark.cli run --runner langgraph_native --out results/langgraph-native-mine.json
python -m benchmark.cli run --runner langgraph_acp    --out results/langgraph-acp-mine.json
# etc.

# Limit to one category for quick iteration
python -m benchmark.cli run --runner acp --category identity_propagation

Each run takes ~5-15 minutes for full 48 scenarios against the live gateway. Native runners (no gateway) finish in seconds.

Expected output

You should see output like:

[1/48] ✓ audit_completeness.01_required_fields  (124ms)
[2/48] ✓ audit_completeness.02_denial_logged    (215ms)
...
[48/48] ✗ scope_inheritance.06_viewer_cannot_write  (98ms)
    ✗ tool_denied — some calls were allowed

======================================================================
AgentGovBench  spec v0.2  library 2026.04
Runner: acp (Agentic Control Plane 0.4.0) — agenticcontrolplane.com
======================================================================

Category                               Pass     Rate
--------------------------------------------------------
audit_completeness                     6/6      100%
cross_tenant_isolation                 4/6       67%
...
total                                 45/48

If you reproduce 45/48 for the acp runner with the three documented declinations, your ACP deployment is at product parity with the published reference. That’s the success case.

If you see different numbers, that’s the benchmark’s main job — either:

You’re on an older ACP version (upgrade and rerun)
You’ve found a real governance gap we haven’t seen yet (please file an issue!)
Your tenant has policy configuration drift from the bootstrap defaults (re-run bootstrap)

Common issues

ModuleNotFoundError: firebase_admin — Run pip install firebase-admin requests pyyaml (these are dependencies of the runner but pip occasionally misses them in editable installs).

Could not load Firebase service account — Check GOOGLE_APPLICATION_CREDENTIALS points at a valid JSON file. Verify with gcloud auth application-default print-access-token (separate test, but should succeed).

401 from /govern/tool-use — The minted Firebase ID token is invalid or signed by an IdP your ACP doesn’t trust. Verify FIREBASE_WEB_API_KEY is correct and matches the project of GOOGLE_APPLICATION_CREDENTIALS.

Every call shows decision: allow with reason fail-open — The runner can’t reach the gateway. Check ACP_BASE_URL (defaults to https://api.agenticcontrolplane.com) and curl -sf $ACP_BASE_URL/health.

Rate limit scenarios noisy — If you ran multiple ACP-paired runners in parallel, the rate-limit scenarios will see inflated counts. Run them serially, or accept ~1-2 scenario noise in the rate_limit_cascade category.

Scores differ from published by 1-2 scenarios — Usually the rate_limit_cascade.01 sliding-window edge case (within 5% tolerance band per our methodology). If the difference is larger, please file an issue with your setup/benchmark_env.yaml and the failing scenario log.

Running against your own framework

The full series of per-framework scorecards uses the same workflow:

Framework	Install command	Runner names
CrewAI	`pip install crewai acp-crewai`	`crewai_native`, `crewai_acp`
LangChain/LangGraph	`pip install langchain-core langgraph`	`langgraph_native`, `langgraph_acp`
Claude Code	(CLI, no pip install)	`claude_code_native`, `claude_code_acp`
Codex CLI	(CLI, no pip install)	`codex_native`, `codex_acp`
OpenAI Agents SDK	`pip install openai openai-agents`	`openai_agents_native`, `openai_agents_acp`
Anthropic Agent SDK	(TS, runner is Python representation)	`anthropic_agent_sdk_native`, `anthropic_agent_sdk_acp`
Cursor	(IDE, runner is Python representation)	`cursor_native`, `cursor_acp`

For the TS frameworks (Anthropic Agent SDK), the runner is a Python representation of the dispatch path — same gateway observables as if you’d actually run the TS code, but reproducible without a Node toolchain.

Running against a competitor governance product

Implement the BaseRunner interface in benchmark/runner.py. Drop your runner in runners/<your-product>.py. ~200 lines of Python.

from benchmark.runner import RunnerMetadata, StatefulRunner

class Runner(StatefulRunner):
    @property
    def metadata(self) -> RunnerMetadata:
        return RunnerMetadata(
            name="your_product",
            version="1.0",
            product="Your Governance Product",
            vendor="yourcompany.com",
        )

    def execute_action(self, action):
        # Push the action through your product's governance pipeline
        # Return a ToolOutcome
        ...

    def audit_log(self):
        # Return list[AuditEntry] from your product's audit
        ...

PR your runner to the repo. We’ll review and merge if it passes the contribution checks.

Receipts and references

agentgovbench repo — clone this
setup/bootstrap_tenant.py — what the bootstrap does
CONTRIBUTING.md — runner contribution template
Methodology post — why we built it this way
NIST mapping — control coverage for procurement
Full scorecard

Share: Twitter LinkedIn