CrewAI's task handoffs lose the audit trail — here's the gap and the fix
tl;dr
In yesterday’s CrewAI scorecard, CrewAI + ACP hit 40/48 on AgentGovBench — 5 below pure ACP’s 45/48. The gap is concentrated in delegation_provenance (2/6 vs 6/6) and scope_inheritance (4/6 vs 6/6). One root cause:
When CrewAI hands off work between agents, the chain context never makes it into the next tool call’s audit metadata.
You’ll see correct allow/deny decisions. You’ll see audit rows for every call. What you won’t see is provenance — the worker’s tool call shows up in the audit log as a top-level call by the user, not as the third hop in a manager → worker → tool chain.
This post: what’s actually happening under the hood, why it matters for forensic reconstruction and EU AI Act Article 14 compliance, and the SDK-side fix that lands in acp-crewai@0.2.0.
The setup
Take a typical Hierarchical Process crew:
manager = Agent(
role="orchestrator",
goal="Triage incoming work; delegate to specialists",
allow_delegation=True,
)
researcher = Agent(
role="researcher",
goal="Gather context",
tools=[web_search, customer_lookup], # both @governed
)
writer = Agent(
role="writer",
goal="Compose the response",
tools=[send_email], # @governed
)
crew = Crew(
agents=[manager, researcher, writer],
tasks=[triage_task],
process=Process.hierarchical,
)
install_crew_hooks(crew) # ACP audits inter-agent handoffs
crew.kickoff()
The user runs this. The manager decides to delegate to the researcher. The researcher calls customer_lookup. The researcher hands off to the writer. The writer calls send_email.
In the audit log, you should see (and would, in pure-ACP):
Hop 1 HUMAN alice → delegate to manager
Hop 2 AGENT manager → delegate to researcher
Hop 3 AGENT researcher · customer_lookup → ALLOW
Hop 4 AGENT researcher → handoff to writer
Hop 5 AGENT writer · send_email → ALLOW
What you actually see in CrewAI + acp-crewai@0.1.0:
? customer_lookup → ALLOW (no chain)
? Agent.Handoff → ALLOW (recorded by install_crew_hooks)
? send_email → ALLOW (no chain)
The handoffs are recorded as separate Agent.Handoff events (good — that’s what install_crew_hooks does). But the next tool call from the worker doesn’t carry the chain context. So the audit row for send_email shows actor=writer (or worse, the originating user) with no record of how the call got there.
What @governed sees today
The @governed decorator looks at acp_governance.get_context() to decide what to forward to the gateway. The context is a plain dataclass:
@dataclass
class GovernanceContext:
user_token: str
session_id: str
agent_tier: Optional[str] = None
agent_name: Optional[str] = None
# No agent_chain field — yet.
You set it once per request:
@app.post("/run")
def run(req: Request, authorization: str = Header(...)):
set_context(user_token=authorization.removeprefix("Bearer "))
crew.kickoff()
crew.kickoff() then runs to completion. CrewAI’s internal task scheduling moves work between agents. Tool calls fire from whichever agent context is active. But set_context() was called once at the top — it doesn’t update as work moves between agents.
Result: every @governed tool call from any agent in the crew sees the same agent_tier / agent_name / (missing) chain. The gateway gets no signal that this call is the third hop in a chain.
Why this matters
Three concrete consequences:
1. Forensic reconstruction breaks. When a tool call goes wrong — wrong customer emailed, wrong file written, wrong API hit — the audit log is supposed to tell you which agent in which chain made that call. Without chain provenance, you have a list of send_email calls attributed to “the writer agent” with no way to reconstruct which delegation path led here. This makes incident response a manual git-grep through CrewAI’s verbose logs.
2. EU AI Act Article 14 transparency requirements. Article 14 requires “appropriate transparency about the operation” of high-risk AI systems, including the ability to “interpret the system’s output.” For multi-agent systems, that means being able to trace which agent in which chain produced an action. An audit log that says “writer sent the email” without recording which manager delegated to which researcher who handed off to which writer doesn’t meet this bar.
3. Scope-narrowing across delegation can’t enforce. ACP’s task-scoped narrowing (the SDK-side feature that says “this delegated subagent’s effective scope is parent ∩ declared”) needs to know what chain a call is in to know which “parent” to intersect against. Without chain context, the gateway can’t enforce that a worker’s tool call respects the manager’s narrowed handoff scopes.
These aren’t theoretical. Scenarios delegation_provenance.01_chain_recorded, .03_three_hop_chain, .04_chain_preserved_on_deny, and scope_inheritance.04_task_narrowing in AgentGovBench all assert exactly these properties. CrewAI + @governed fails them today.
The fix — acp-crewai@0.2.0
Two changes:
1. Extend GovernanceContext with agent_chain: list[str]. The chain is the ordered list of agent names from root to current. Forwarded to the gateway as agent_chain on every /govern/tool-use call. Gateway already accepts this field (it’s how the Claude Code hook propagates subagent chains).
2. Make install_crew_hooks(crew) update the active context as agents change. When CrewAI’s step_callback fires for a new agent, push that agent name onto the chain in the active context. When task_callback fires for a task transition, update accordingly. The @governed wrapper picks up the updated context on the next call.
The patch is small (~50 lines in acp-crewai/_crew_hooks.py plus the context field). The behavior change is significant — delegation_provenance jumps from 2/6 to 6/6, scope_inheritance from 4/6 to 6/6, total CrewAI + ACP score from 40/48 → 44/48.
We’re shipping this in acp-crewai@0.2.0 next week. Tracking issue and PR are at [TODO]. The benchmark run will be re-published with the new score and the gap explicitly closed in results/crewai-acp-v0.2.json.
Until 0.2.0 — what to do
If you’re running CrewAI + acp-crewai@0.1.x in production today and care about delegation provenance:
-
Use Sequential Process where possible. Sequential handoffs (
Task N.output → Task N+1.context) are pure data passing — no agent context switch. The worker is still the same agent that ran the previous task, so chain context degenerates to one agent and the audit reflects that correctly. -
Avoid Hierarchical Process for high-stakes tool calls. If a tool call needs full audit provenance — anything touching customer data, money, deploys, communications — keep that tool’s caller a top-level agent rather than a manager-delegated subagent.
-
Manually thread chain via
set_context(agent_name=...)at handoff points. Inside yourtask_callback, callset_context(agent_name=next_agent.role)before the next task starts. This doesn’t give you the full chain but at least gives the gateway the current agent name. Imperfect but better than nothing.
These are workarounds. The fix is the SDK update — and it lands soon.
Why we ship this honest
Benchmarks become credible when they’re honest about partial failures. We could have not shipped the crewai_acp runner. We could have shipped it without these scenarios. We could have quietly waited until 0.2.0 to launch the per-framework series.
We chose to ship now, with the gap visible, the root cause documented, and the fix dated. That’s the same standard the original ACP scorecard holds itself to (3 documented declinations against pure ACP’s 45/48). It’s the only standard a benchmark can credibly publish under.
If you maintain a competitor framework or governance product, AgentGovBench is open. PRs welcome. So is reproducing these numbers against your own ACP instance — see the benchmark page for how.
Next in the series: LangGraph runner pair + scorecard. LangGraph’s tool dispatch is the same @tool pattern as CrewAI but its supervisor-worker StateGraph has a different handoff model worth its own post.
Receipts:
- crewai_acp runner source
- delegation_provenance scenarios
- Yesterday’s scorecard post
acp-crewaion GitHub
- 1. How we think about testing AI agent governance
- 2. CrewAI scores 13/48 on AgentGovBench. With ACP, 40/48.
- 3. CrewAI's task handoffs lose the audit trail — here's the gap and the fix · you are here
- 4. LangGraph scores 13/48 on AgentGovBench. With ACP, 40/48.
- 5. LangGraph's StateGraph checkpoints don't replay through governance
- 6. Claude Code scores 13/48 on AgentGovBench. With ACP, 43/48.
- 7. Claude Code's --dangerously-skip-permissions disables every governance hook
- 8. Decorator, proxy, hook — three patterns for agent governance, three different scorecards
- 9. OpenAI Agents SDK scores 13/48 on AgentGovBench. With ACP, 45/48.
- 10. Anthropic Agent SDK scores 13/48 on AgentGovBench. With ACP, 46/48 — best of any framework.
- 11. Codex CLI scores 13/48 on AgentGovBench. With ACP, 43/48 — same as Claude Code.
- 12. Full scorecard: seven frameworks, 48 scenarios, one open benchmark
- 13. How AgentGovBench's 48 scenarios map to NIST AI RMF 1.0
- 14. Reproduce AgentGovBench on your stack — full setup guide
- 15. Cursor scores 13/48 on AgentGovBench. With ACP MCP server, 37/48 — and that gap is structural.