Skip to content
Agentic Control Plane
Benchmark series · Part 4 of 15
AgentGovBench →

LangGraph scores 13/48 on AgentGovBench. With ACP, 40/48.

David Crowe · · 4 min read
langchain langgraph benchmark governance agentgovbench

tl;dr

Second framework in the series. Same shape: two runners against AgentGovBench’s 48 scenarios, deterministic, no LLM in the hot path.

Configuration Score
LangChain/LangGraph (no callback handler) 13/48 — same vanilla floor as CrewAI native
LangChain/LangGraph + ACP (@governed) 40/48

Identical native score to CrewAI. The BaseCallbackHandler infrastructure is there — but no handler is attached by default, so every “LangChain emits audit logs” claim is conditional on you having wired it up.

The full per-category breakdown is below. The two patterns we expect to see (and discuss): identity propagation and per-user policy enforcement going from broken to clean, and delegation provenance going from broken to better but not perfect — the same root cause as CrewAI (the @governed wrapper doesn’t yet thread LangGraph’s StateGraph node-to-node context into per-call audit metadata).

Why LangGraph native scores at the floor

LangChain ships with @tool and BaseCallbackHandler infrastructure. LangGraph ships with StateGraph, create_react_agent, supervisor-worker patterns. Neither ships with:

  • Per-user identity propagation. Each StateGraph node runs in the same process; identity is whatever the caller set. Without explicit threading, the end user’s identity doesn’t reach individual node tool calls.
  • Per-tool policy enforcement. Tools listed in Agent(tools=[...]) or create_react_agent(tools=...) are callable. No scope check.
  • An audit log by default. LangSmith exists as a separate product; default LangChain has callback handlers but none attached.
  • Workspace policy. No concept.
  • Fail-mode discipline. Nothing to fail.

LangGraph-specific failure modes:

  1. StateGraph node-to-node transitions are state mutations, not events. A supervisor adding a worker’s output to graph state isn’t an event the governance layer sees. Audit silently misses it unless you add a custom reducer or callback that fires.

  2. Checkpoint replay loses governance context. When LangGraph resumes from a checkpoint, no governance pipeline re-runs against the replayed state. Policy changes between original run and replay are silently ignored — same as CrewAI’s hierarchical handoff problem in a different shape.

  3. Per-user state is just state["user"]. It’s whatever you put there. There’s no validated identity envelope; an upstream node can mutate it. Identity-as-state is identity-as-suggestion.

Score: 13/48. Same floor as CrewAI. Same conclusion: LangGraph is an orchestration framework, not a governance layer.

What ACP adds

acp-langchain is the same one-decorator integration as acp-crewai:

from langchain_core.tools import tool
from acp_langchain import governed, set_context

@tool
@governed("send_email")
def send_email(to: str, subject: str, body: str) -> str:
    return sendmail(to, subject, body)

@app.post("/run")
def run(req: Request, authorization: str = Header(...)):
    set_context(user_token=authorization.removeprefix("Bearer "))
    agent = create_react_agent(model, tools=[send_email])
    return agent.invoke({"messages": [("user", req.prompt)]})

Stack @governed under @tool. Same /govern/tool-use endpoint as CrewAI, Claude Code, and direct ACP calls.

Per-category breakdown

Category Native + ACP Note
Audit completeness 1/6 6/6 Every call structured-logged with attribution.
Cross-tenant isolation 4/6 4/6 Two declined (single-tenant deployment mode).
Delegation provenance 0/6 2/6 StateGraph node transitions: same gap as CrewAI handoffs.
Fail-mode discipline 3/6 6/6  
Identity propagation 0/6 6/6 End-user JWT verified per call.
Per-user policy enforcement 1/6 6/6 Allow/deny/rate per identity.
Rate-limit cascade 3/6 6/6 Fan-out aggregated per user.
Scope inheritance 1/6 4/6 Same root cause as delegation_provenance.
Total 13/48 40/48  

What this means for your LangGraph deployment

Same playbook as CrewAI:

  1. You’re at vanilla unless callbacks are wired. “We have LangChain logging” usually means a print statement somewhere. That’s not audit. Real audit needs structured records, attribution, trace IDs — none default.
  2. acp-langchain is one decorator + one set_context(). Install guide here.
  3. Reproduce the numbers with python -m benchmark.cli run --runner langgraph_native and --runner langgraph_acp from the agentgovbench repo.

What’s next

Tomorrow: Claude Code scorecard. Hook pattern (different from the decorator pattern in CrewAI/LangChain). Same score story, distinct failure modes — including --dangerously-skip-permissions, the documented escape hatch.


Receipts:

Share: Twitter LinkedIn
More in AgentGovBench
  1. 1. How we think about testing AI agent governance
  2. 2. CrewAI scores 13/48 on AgentGovBench. With ACP, 40/48.
  3. 3. CrewAI's task handoffs lose the audit trail — here's the gap and the fix
  4. 4. LangGraph scores 13/48 on AgentGovBench. With ACP, 40/48. · you are here
  5. 5. LangGraph's StateGraph checkpoints don't replay through governance
  6. 6. Claude Code scores 13/48 on AgentGovBench. With ACP, 43/48.
  7. 7. Claude Code's --dangerously-skip-permissions disables every governance hook
  8. 8. Decorator, proxy, hook — three patterns for agent governance, three different scorecards
  9. 9. OpenAI Agents SDK scores 13/48 on AgentGovBench. With ACP, 45/48.
  10. 10. Anthropic Agent SDK scores 13/48 on AgentGovBench. With ACP, 46/48 — best of any framework.
  11. 11. Codex CLI scores 13/48 on AgentGovBench. With ACP, 43/48 — same as Claude Code.
  12. 12. Full scorecard: seven frameworks, 48 scenarios, one open benchmark
  13. 13. How AgentGovBench's 48 scenarios map to NIST AI RMF 1.0
  14. 14. Reproduce AgentGovBench on your stack — full setup guide
  15. 15. Cursor scores 13/48 on AgentGovBench. With ACP MCP server, 37/48 — and that gap is structural.
Related posts

← back to blog