Skip to content
Agentic Control Plane
Benchmark series · Part 3 of 15
AgentGovBench →

CrewAI's task handoffs lose the audit trail — here's the gap and the fix

David Crowe · · 7 min read
crewai governance audit agent-delegation agentgovbench

tl;dr

In yesterday’s CrewAI scorecard, CrewAI + ACP hit 40/48 on AgentGovBench — 5 below pure ACP’s 45/48. The gap is concentrated in delegation_provenance (2/6 vs 6/6) and scope_inheritance (4/6 vs 6/6). One root cause:

When CrewAI hands off work between agents, the chain context never makes it into the next tool call’s audit metadata.

You’ll see correct allow/deny decisions. You’ll see audit rows for every call. What you won’t see is provenance — the worker’s tool call shows up in the audit log as a top-level call by the user, not as the third hop in a manager → worker → tool chain.

This post: what’s actually happening under the hood, why it matters for forensic reconstruction and EU AI Act Article 14 compliance, and the SDK-side fix that lands in acp-crewai@0.2.0.

The setup

Take a typical Hierarchical Process crew:

manager = Agent(
    role="orchestrator",
    goal="Triage incoming work; delegate to specialists",
    allow_delegation=True,
)

researcher = Agent(
    role="researcher",
    goal="Gather context",
    tools=[web_search, customer_lookup],   # both @governed
)

writer = Agent(
    role="writer",
    goal="Compose the response",
    tools=[send_email],                     # @governed
)

crew = Crew(
    agents=[manager, researcher, writer],
    tasks=[triage_task],
    process=Process.hierarchical,
)
install_crew_hooks(crew)   # ACP audits inter-agent handoffs
crew.kickoff()

The user runs this. The manager decides to delegate to the researcher. The researcher calls customer_lookup. The researcher hands off to the writer. The writer calls send_email.

In the audit log, you should see (and would, in pure-ACP):

Hop 1  HUMAN  alice                                    →  delegate to manager
Hop 2  AGENT  manager                                  →  delegate to researcher
Hop 3  AGENT  researcher  · customer_lookup            →  ALLOW
Hop 4  AGENT  researcher                               →  handoff to writer
Hop 5  AGENT  writer      · send_email                 →  ALLOW

What you actually see in CrewAI + acp-crewai@0.1.0:

                          ?  customer_lookup           →  ALLOW   (no chain)
                          ?  Agent.Handoff             →  ALLOW   (recorded by install_crew_hooks)
                          ?  send_email                →  ALLOW   (no chain)

The handoffs are recorded as separate Agent.Handoff events (good — that’s what install_crew_hooks does). But the next tool call from the worker doesn’t carry the chain context. So the audit row for send_email shows actor=writer (or worse, the originating user) with no record of how the call got there.

What @governed sees today

The @governed decorator looks at acp_governance.get_context() to decide what to forward to the gateway. The context is a plain dataclass:

@dataclass
class GovernanceContext:
    user_token: str
    session_id: str
    agent_tier: Optional[str] = None
    agent_name: Optional[str] = None
    # No agent_chain field — yet.

You set it once per request:

@app.post("/run")
def run(req: Request, authorization: str = Header(...)):
    set_context(user_token=authorization.removeprefix("Bearer "))
    crew.kickoff()

crew.kickoff() then runs to completion. CrewAI’s internal task scheduling moves work between agents. Tool calls fire from whichever agent context is active. But set_context() was called once at the top — it doesn’t update as work moves between agents.

Result: every @governed tool call from any agent in the crew sees the same agent_tier / agent_name / (missing) chain. The gateway gets no signal that this call is the third hop in a chain.

Why this matters

Three concrete consequences:

1. Forensic reconstruction breaks. When a tool call goes wrong — wrong customer emailed, wrong file written, wrong API hit — the audit log is supposed to tell you which agent in which chain made that call. Without chain provenance, you have a list of send_email calls attributed to “the writer agent” with no way to reconstruct which delegation path led here. This makes incident response a manual git-grep through CrewAI’s verbose logs.

2. EU AI Act Article 14 transparency requirements. Article 14 requires “appropriate transparency about the operation” of high-risk AI systems, including the ability to “interpret the system’s output.” For multi-agent systems, that means being able to trace which agent in which chain produced an action. An audit log that says “writer sent the email” without recording which manager delegated to which researcher who handed off to which writer doesn’t meet this bar.

3. Scope-narrowing across delegation can’t enforce. ACP’s task-scoped narrowing (the SDK-side feature that says “this delegated subagent’s effective scope is parent ∩ declared”) needs to know what chain a call is in to know which “parent” to intersect against. Without chain context, the gateway can’t enforce that a worker’s tool call respects the manager’s narrowed handoff scopes.

These aren’t theoretical. Scenarios delegation_provenance.01_chain_recorded, .03_three_hop_chain, .04_chain_preserved_on_deny, and scope_inheritance.04_task_narrowing in AgentGovBench all assert exactly these properties. CrewAI + @governed fails them today.

The fix — acp-crewai@0.2.0

Two changes:

1. Extend GovernanceContext with agent_chain: list[str]. The chain is the ordered list of agent names from root to current. Forwarded to the gateway as agent_chain on every /govern/tool-use call. Gateway already accepts this field (it’s how the Claude Code hook propagates subagent chains).

2. Make install_crew_hooks(crew) update the active context as agents change. When CrewAI’s step_callback fires for a new agent, push that agent name onto the chain in the active context. When task_callback fires for a task transition, update accordingly. The @governed wrapper picks up the updated context on the next call.

The patch is small (~50 lines in acp-crewai/_crew_hooks.py plus the context field). The behavior change is significant — delegation_provenance jumps from 2/6 to 6/6, scope_inheritance from 4/6 to 6/6, total CrewAI + ACP score from 40/48 → 44/48.

We’re shipping this in acp-crewai@0.2.0 next week. Tracking issue and PR are at [TODO]. The benchmark run will be re-published with the new score and the gap explicitly closed in results/crewai-acp-v0.2.json.

Until 0.2.0 — what to do

If you’re running CrewAI + acp-crewai@0.1.x in production today and care about delegation provenance:

  1. Use Sequential Process where possible. Sequential handoffs (Task N.output → Task N+1.context) are pure data passing — no agent context switch. The worker is still the same agent that ran the previous task, so chain context degenerates to one agent and the audit reflects that correctly.

  2. Avoid Hierarchical Process for high-stakes tool calls. If a tool call needs full audit provenance — anything touching customer data, money, deploys, communications — keep that tool’s caller a top-level agent rather than a manager-delegated subagent.

  3. Manually thread chain via set_context(agent_name=...) at handoff points. Inside your task_callback, call set_context(agent_name=next_agent.role) before the next task starts. This doesn’t give you the full chain but at least gives the gateway the current agent name. Imperfect but better than nothing.

These are workarounds. The fix is the SDK update — and it lands soon.

Why we ship this honest

Benchmarks become credible when they’re honest about partial failures. We could have not shipped the crewai_acp runner. We could have shipped it without these scenarios. We could have quietly waited until 0.2.0 to launch the per-framework series.

We chose to ship now, with the gap visible, the root cause documented, and the fix dated. That’s the same standard the original ACP scorecard holds itself to (3 documented declinations against pure ACP’s 45/48). It’s the only standard a benchmark can credibly publish under.

If you maintain a competitor framework or governance product, AgentGovBench is open. PRs welcome. So is reproducing these numbers against your own ACP instance — see the benchmark page for how.


Next in the series: LangGraph runner pair + scorecard. LangGraph’s tool dispatch is the same @tool pattern as CrewAI but its supervisor-worker StateGraph has a different handoff model worth its own post.

Receipts:

Share: Twitter LinkedIn
More in AgentGovBench
  1. 1. How we think about testing AI agent governance
  2. 2. CrewAI scores 13/48 on AgentGovBench. With ACP, 40/48.
  3. 3. CrewAI's task handoffs lose the audit trail — here's the gap and the fix · you are here
  4. 4. LangGraph scores 13/48 on AgentGovBench. With ACP, 40/48.
  5. 5. LangGraph's StateGraph checkpoints don't replay through governance
  6. 6. Claude Code scores 13/48 on AgentGovBench. With ACP, 43/48.
  7. 7. Claude Code's --dangerously-skip-permissions disables every governance hook
  8. 8. Decorator, proxy, hook — three patterns for agent governance, three different scorecards
  9. 9. OpenAI Agents SDK scores 13/48 on AgentGovBench. With ACP, 45/48.
  10. 10. Anthropic Agent SDK scores 13/48 on AgentGovBench. With ACP, 46/48 — best of any framework.
  11. 11. Codex CLI scores 13/48 on AgentGovBench. With ACP, 43/48 — same as Claude Code.
  12. 12. Full scorecard: seven frameworks, 48 scenarios, one open benchmark
  13. 13. How AgentGovBench's 48 scenarios map to NIST AI RMF 1.0
  14. 14. Reproduce AgentGovBench on your stack — full setup guide
  15. 15. Cursor scores 13/48 on AgentGovBench. With ACP MCP server, 37/48 — and that gap is structural.
Related posts

← back to blog