CrewAI scores 13/48 on AgentGovBench. With ACP, 40/48.
tl;dr
We ran CrewAI through AgentGovBench’s 48 deterministic governance scenarios — twice. Once with a clean CrewAI install, once with acp-crewai wrapping the tools.
| Configuration | Score |
|---|---|
| CrewAI OSS (no callbacks wired) | 13/48 — same as the vanilla floor |
CrewAI + ACP (@governed) |
40/48 — full enforcement on identity, policy, audit, rate, fail-mode |
The 27-scenario jump is what enforcement buys you. The 5-scenario gap from pure ACP (45/48) is concentrated in delegation_provenance (2/6 vs 6/6) and scope_inheritance (4/6 vs 6/6) — not because ACP doesn’t work, but because the @governed decorator doesn’t yet thread CrewAI’s task-handoff context into the per-call audit metadata. Specific, fixable, on the roadmap. Detail below.
Both runners are open source. Both results JSONs are in the agentgovbench repo. The benchmark itself ships with three published baseline runners (vanilla, audit_only, acp) — the new crewai_native and crewai_acp runners follow the same BaseRunner interface and are deterministic, no LLM in the hot path.
Why CrewAI native scores at the floor
CrewAI OSS gives you @tool-decorated functions, Agent / Crew / Task composition, sequential and hierarchical processes, and an allow_delegation flag. It does not give you, out of the box:
- Identity on tool calls. Tools dispatch from the agent’s context, not the end user’s. The end user’s identity doesn’t flow to the tool unless you wire it yourself.
- Per-tool policy enforcement. Whatever tools you pass to
Agent(tools=[...])are callable. There’s no concept of “this tool requires scope X.” - Per-user rate limits. No user concept means no per-user budget.
- An audit log.
task_callbackandstep_callbackexist as optional hooks. Default isNone. No callback wired = no audit. - A policy document. No notion of workspace policy at any level.
- Fail-mode discipline. Nothing to fail because there’s no governance layer.
So the categories that require any of these — identity_propagation (0/6), delegation_provenance (0/6), audit_completeness (1/6), per_user_policy_enforcement (1/6) — fail at the structural level. The categories that can pass without governance because they assert on benign baseline behaviors (cross_tenant_isolation, fail_mode_discipline, rate_limit_cascade, scope_inheritance) score 3-4/6.
Total: 13/48. Identical to vanilla. This isn’t a CrewAI flaw — it’s a clarification of what CrewAI’s scope is. CrewAI is an orchestration framework, not a governance layer. Using it as one is a mismatch.
What ACP adds
acp-crewai is a small adapter:
from crewai.tools import tool
from acp_crewai import governed, install_crew_hooks, set_context
@tool("send_email")
@governed("send_email")
def send_email(to: str, subject: str, body: str) -> str:
return sendmail(to, subject, body)
Stack @governed under @tool. Before each tool call, the wrapper POSTs to ACP’s /govern/tool-use with the tool name, input, session ID, and Authorization: Bearer <user-jwt>. ACP decides allow / deny / redact based on workspace policy, the end user’s scopes, rate limits, and PII detection.
For inter-agent delegation, install_crew_hooks(crew) attaches task_callback and step_callback to the Crew so handoffs emit synthetic Agent.Handoff audit events.
This is the entire integration. Same patterns as the Claude Code hook, the LangChain decorator, and the Anthropic Agent SDK wrapper — same governance pipeline behind all of them.
The score jumps to 40/48:
| Category | Native | + ACP | Note |
|---|---|---|---|
| Audit completeness | 1/6 | 6/6 | Every call logged with attribution and trace ID. |
| Cross-tenant isolation | 4/6 | 4/6 | Two declined (single-tenant deployment mode — same as pure ACP). |
| Delegation provenance | 0/6 | 2/6 | Gap: see below. |
| Fail-mode discipline | 3/6 | 6/6 | Fail-open vs fail-closed honored per scenario. |
| Identity propagation | 0/6 | 6/6 | End-user JWT verified on every call. |
| Per-user policy enforcement | 1/6 | 6/6 | Allow / deny / rate-limit per identity. |
| Rate-limit cascade | 3/6 | 6/6 | Fan-out aggregated per user, not per tenant. |
| Scope inheritance | 1/6 | 4/6 | Gap: same root cause as delegation provenance. |
| Total | 13/48 | 40/48 |
The gap — and why it ships honest
The 5-scenario gap from pure ACP (45/48) is concentrated in one root cause:
@governed doesn’t yet thread CrewAI’s task-handoff chain into per-call agent_chain metadata.
When CrewAI’s Hierarchical Process delegates from a manager agent to a worker, or when a Sequential Process passes a task’s output into the next task’s context, install_crew_hooks(crew) records the handoff as a synthetic Agent.Handoff audit event. That’s good. What it doesn’t do is propagate the chain into the next tool call’s context, so when the worker subsequently calls a tool, the gateway sees the call as originating from a top-level agent rather than as the third hop in a chain.
Result: scenarios that assert “the audit shows the worker as the third hop in a three-agent chain” fail. The decision is correct (the right user is allowed/denied), but the provenance metadata is lighter than it should be.
This is the kind of gap a benchmark exists to surface. We could have not shipped the runner, or shipped it without these scenarios. We shipped it as-is with the failures front and center, the same way the original ACP runner ships with three documented declinations. Honest reporting is the only report anyone believes.
The fix is in the SDK adapter: thread the active chain into set_context(agent_chain=...) so the @governed wrapper picks it up on the next call. Tracking issue is [TODO]. Expected to land in acp-crewai@0.2.0.
What this means for your CrewAI deployment
If you’re running CrewAI in production:
-
You’re at vanilla unless you’ve wired callbacks. “We have CrewAI logging” usually means there’s a
task_callbackprinting to stdout. That’s not audit; it’s debug output. Real audit needs structured records, attribution, trace IDs, and a place to query them. None of that is default. -
acp-crewaiis one decorator and oneset_context()call. The integration is in /integrations/crewai. A runnable starter (FastAPI + 2-agent crew + 3 tools designed to demo allow / redact / deny) is at crewai-acp-starter. -
The benchmark is reproducible. If you want to verify these numbers against your own ACP instance:
git clone https://github.com/agentic-control-plane/agentgovbench cd agentgovbench python -m venv .venv && source .venv/bin/activate pip install -e . pip install crewai acp-crewai # Point at your ACP project (see README.md for full env setup) export GOOGLE_APPLICATION_CREDENTIALS=/path/to/firebase-creds.json export AGB_TENANT_ID=... export FIREBASE_WEB_API_KEY=... python -m benchmark.cli run --runner crewai_native --out results/my-crewai-native.json python -m benchmark.cli run --runner crewai_acp --out results/my-crewai-acp.jsonYou should match the published results. If you don’t, that’s either version drift on your ACP deployment or a bug we haven’t seen — please file an issue.
What’s next
This is the first in a series. Over the next few weeks we’re publishing the same pair-of-runners treatment for LangChain / LangGraph, Claude Code, OpenAI Agents SDK, Anthropic Agent SDK, and Cursor. Each comes with two runners + a results JSON + a post like this one. The full per-framework scoreboard will live on /benchmark.
Subscribe to the blog or follow @reducibl on X for the next drop.
Receipts:
- agentgovbench repo
- crewai_native runner source
- crewai_acp runner source
- crewai-native-v0.1.json results
- crewai-acp-v0.1.json results
- /integrations/crewai install guide
- crewai-acp-starter — runnable reference
- Methodology post — How we think about testing agent governance
- 1. How we think about testing AI agent governance
- 2. CrewAI scores 13/48 on AgentGovBench. With ACP, 40/48. · you are here
- 3. CrewAI's task handoffs lose the audit trail — here's the gap and the fix
- 4. LangGraph scores 13/48 on AgentGovBench. With ACP, 40/48.
- 5. LangGraph's StateGraph checkpoints don't replay through governance
- 6. Claude Code scores 13/48 on AgentGovBench. With ACP, 43/48.
- 7. Claude Code's --dangerously-skip-permissions disables every governance hook
- 8. Decorator, proxy, hook — three patterns for agent governance, three different scorecards
- 9. OpenAI Agents SDK scores 13/48 on AgentGovBench. With ACP, 45/48.
- 10. Anthropic Agent SDK scores 13/48 on AgentGovBench. With ACP, 46/48 — best of any framework.
- 11. Codex CLI scores 13/48 on AgentGovBench. With ACP, 43/48 — same as Claude Code.
- 12. Full scorecard: seven frameworks, 48 scenarios, one open benchmark
- 13. How AgentGovBench's 48 scenarios map to NIST AI RMF 1.0
- 14. Reproduce AgentGovBench on your stack — full setup guide
- 15. Cursor scores 13/48 on AgentGovBench. With ACP MCP server, 37/48 — and that gap is structural.