Skip to content
Agentic Control Plane
Benchmark series · Part 2 of 15
AgentGovBench →

CrewAI scores 13/48 on AgentGovBench. With ACP, 40/48.

David Crowe · · 6 min read
crewai benchmark governance agentgovbench

tl;dr

We ran CrewAI through AgentGovBench’s 48 deterministic governance scenarios — twice. Once with a clean CrewAI install, once with acp-crewai wrapping the tools.

Configuration Score
CrewAI OSS (no callbacks wired) 13/48 — same as the vanilla floor
CrewAI + ACP (@governed) 40/48 — full enforcement on identity, policy, audit, rate, fail-mode

The 27-scenario jump is what enforcement buys you. The 5-scenario gap from pure ACP (45/48) is concentrated in delegation_provenance (2/6 vs 6/6) and scope_inheritance (4/6 vs 6/6) — not because ACP doesn’t work, but because the @governed decorator doesn’t yet thread CrewAI’s task-handoff context into the per-call audit metadata. Specific, fixable, on the roadmap. Detail below.

Both runners are open source. Both results JSONs are in the agentgovbench repo. The benchmark itself ships with three published baseline runners (vanilla, audit_only, acp) — the new crewai_native and crewai_acp runners follow the same BaseRunner interface and are deterministic, no LLM in the hot path.

Why CrewAI native scores at the floor

CrewAI OSS gives you @tool-decorated functions, Agent / Crew / Task composition, sequential and hierarchical processes, and an allow_delegation flag. It does not give you, out of the box:

  • Identity on tool calls. Tools dispatch from the agent’s context, not the end user’s. The end user’s identity doesn’t flow to the tool unless you wire it yourself.
  • Per-tool policy enforcement. Whatever tools you pass to Agent(tools=[...]) are callable. There’s no concept of “this tool requires scope X.”
  • Per-user rate limits. No user concept means no per-user budget.
  • An audit log. task_callback and step_callback exist as optional hooks. Default is None. No callback wired = no audit.
  • A policy document. No notion of workspace policy at any level.
  • Fail-mode discipline. Nothing to fail because there’s no governance layer.

So the categories that require any of these — identity_propagation (0/6), delegation_provenance (0/6), audit_completeness (1/6), per_user_policy_enforcement (1/6) — fail at the structural level. The categories that can pass without governance because they assert on benign baseline behaviors (cross_tenant_isolation, fail_mode_discipline, rate_limit_cascade, scope_inheritance) score 3-4/6.

Total: 13/48. Identical to vanilla. This isn’t a CrewAI flaw — it’s a clarification of what CrewAI’s scope is. CrewAI is an orchestration framework, not a governance layer. Using it as one is a mismatch.

What ACP adds

acp-crewai is a small adapter:

from crewai.tools import tool
from acp_crewai import governed, install_crew_hooks, set_context

@tool("send_email")
@governed("send_email")
def send_email(to: str, subject: str, body: str) -> str:
    return sendmail(to, subject, body)

Stack @governed under @tool. Before each tool call, the wrapper POSTs to ACP’s /govern/tool-use with the tool name, input, session ID, and Authorization: Bearer <user-jwt>. ACP decides allow / deny / redact based on workspace policy, the end user’s scopes, rate limits, and PII detection.

For inter-agent delegation, install_crew_hooks(crew) attaches task_callback and step_callback to the Crew so handoffs emit synthetic Agent.Handoff audit events.

This is the entire integration. Same patterns as the Claude Code hook, the LangChain decorator, and the Anthropic Agent SDK wrapper — same governance pipeline behind all of them.

The score jumps to 40/48:

Category Native + ACP Note
Audit completeness 1/6 6/6 Every call logged with attribution and trace ID.
Cross-tenant isolation 4/6 4/6 Two declined (single-tenant deployment mode — same as pure ACP).
Delegation provenance 0/6 2/6 Gap: see below.
Fail-mode discipline 3/6 6/6 Fail-open vs fail-closed honored per scenario.
Identity propagation 0/6 6/6 End-user JWT verified on every call.
Per-user policy enforcement 1/6 6/6 Allow / deny / rate-limit per identity.
Rate-limit cascade 3/6 6/6 Fan-out aggregated per user, not per tenant.
Scope inheritance 1/6 4/6 Gap: same root cause as delegation provenance.
Total 13/48 40/48  

The gap — and why it ships honest

The 5-scenario gap from pure ACP (45/48) is concentrated in one root cause:

@governed doesn’t yet thread CrewAI’s task-handoff chain into per-call agent_chain metadata.

When CrewAI’s Hierarchical Process delegates from a manager agent to a worker, or when a Sequential Process passes a task’s output into the next task’s context, install_crew_hooks(crew) records the handoff as a synthetic Agent.Handoff audit event. That’s good. What it doesn’t do is propagate the chain into the next tool call’s context, so when the worker subsequently calls a tool, the gateway sees the call as originating from a top-level agent rather than as the third hop in a chain.

Result: scenarios that assert “the audit shows the worker as the third hop in a three-agent chain” fail. The decision is correct (the right user is allowed/denied), but the provenance metadata is lighter than it should be.

This is the kind of gap a benchmark exists to surface. We could have not shipped the runner, or shipped it without these scenarios. We shipped it as-is with the failures front and center, the same way the original ACP runner ships with three documented declinations. Honest reporting is the only report anyone believes.

The fix is in the SDK adapter: thread the active chain into set_context(agent_chain=...) so the @governed wrapper picks it up on the next call. Tracking issue is [TODO]. Expected to land in acp-crewai@0.2.0.

What this means for your CrewAI deployment

If you’re running CrewAI in production:

  1. You’re at vanilla unless you’ve wired callbacks. “We have CrewAI logging” usually means there’s a task_callback printing to stdout. That’s not audit; it’s debug output. Real audit needs structured records, attribution, trace IDs, and a place to query them. None of that is default.

  2. acp-crewai is one decorator and one set_context() call. The integration is in /integrations/crewai. A runnable starter (FastAPI + 2-agent crew + 3 tools designed to demo allow / redact / deny) is at crewai-acp-starter.

  3. The benchmark is reproducible. If you want to verify these numbers against your own ACP instance:

    git clone https://github.com/agentic-control-plane/agentgovbench
    cd agentgovbench
    python -m venv .venv && source .venv/bin/activate
    pip install -e .
    pip install crewai acp-crewai
    
    # Point at your ACP project (see README.md for full env setup)
    export GOOGLE_APPLICATION_CREDENTIALS=/path/to/firebase-creds.json
    export AGB_TENANT_ID=...
    export FIREBASE_WEB_API_KEY=...
    
    python -m benchmark.cli run --runner crewai_native --out results/my-crewai-native.json
    python -m benchmark.cli run --runner crewai_acp --out results/my-crewai-acp.json
    

    You should match the published results. If you don’t, that’s either version drift on your ACP deployment or a bug we haven’t seen — please file an issue.

What’s next

This is the first in a series. Over the next few weeks we’re publishing the same pair-of-runners treatment for LangChain / LangGraph, Claude Code, OpenAI Agents SDK, Anthropic Agent SDK, and Cursor. Each comes with two runners + a results JSON + a post like this one. The full per-framework scoreboard will live on /benchmark.

Subscribe to the blog or follow @reducibl on X for the next drop.


Receipts:

Share: Twitter LinkedIn
More in AgentGovBench
  1. 1. How we think about testing AI agent governance
  2. 2. CrewAI scores 13/48 on AgentGovBench. With ACP, 40/48. · you are here
  3. 3. CrewAI's task handoffs lose the audit trail — here's the gap and the fix
  4. 4. LangGraph scores 13/48 on AgentGovBench. With ACP, 40/48.
  5. 5. LangGraph's StateGraph checkpoints don't replay through governance
  6. 6. Claude Code scores 13/48 on AgentGovBench. With ACP, 43/48.
  7. 7. Claude Code's --dangerously-skip-permissions disables every governance hook
  8. 8. Decorator, proxy, hook — three patterns for agent governance, three different scorecards
  9. 9. OpenAI Agents SDK scores 13/48 on AgentGovBench. With ACP, 45/48.
  10. 10. Anthropic Agent SDK scores 13/48 on AgentGovBench. With ACP, 46/48 — best of any framework.
  11. 11. Codex CLI scores 13/48 on AgentGovBench. With ACP, 43/48 — same as Claude Code.
  12. 12. Full scorecard: seven frameworks, 48 scenarios, one open benchmark
  13. 13. How AgentGovBench's 48 scenarios map to NIST AI RMF 1.0
  14. 14. Reproduce AgentGovBench on your stack — full setup guide
  15. 15. Cursor scores 13/48 on AgentGovBench. With ACP MCP server, 37/48 — and that gap is structural.
Related posts

← back to blog