LangGraph scores 13/48 on AgentGovBench. With ACP, 40/48.
tl;dr
Second framework in the series. Same shape: two runners against AgentGovBench’s 48 scenarios, deterministic, no LLM in the hot path.
| Configuration | Score |
|---|---|
| LangChain/LangGraph (no callback handler) | 13/48 — same vanilla floor as CrewAI native |
LangChain/LangGraph + ACP (@governed) |
40/48 |
Identical native score to CrewAI. The BaseCallbackHandler infrastructure is there — but no handler is attached by default, so every “LangChain emits audit logs” claim is conditional on you having wired it up.
The full per-category breakdown is below. The two patterns we expect to see (and discuss): identity propagation and per-user policy enforcement going from broken to clean, and delegation provenance going from broken to better but not perfect — the same root cause as CrewAI (the @governed wrapper doesn’t yet thread LangGraph’s StateGraph node-to-node context into per-call audit metadata).
Why LangGraph native scores at the floor
LangChain ships with @tool and BaseCallbackHandler infrastructure. LangGraph ships with StateGraph, create_react_agent, supervisor-worker patterns. Neither ships with:
- Per-user identity propagation. Each
StateGraphnode runs in the same process; identity is whatever the caller set. Without explicit threading, the end user’s identity doesn’t reach individual node tool calls. - Per-tool policy enforcement. Tools listed in
Agent(tools=[...])orcreate_react_agent(tools=...)are callable. No scope check. - An audit log by default. LangSmith exists as a separate product; default LangChain has callback handlers but none attached.
- Workspace policy. No concept.
- Fail-mode discipline. Nothing to fail.
LangGraph-specific failure modes:
-
StateGraph node-to-node transitions are state mutations, not events. A supervisor adding a worker’s output to graph state isn’t an event the governance layer sees. Audit silently misses it unless you add a custom reducer or callback that fires.
-
Checkpoint replay loses governance context. When LangGraph resumes from a checkpoint, no governance pipeline re-runs against the replayed state. Policy changes between original run and replay are silently ignored — same as CrewAI’s hierarchical handoff problem in a different shape.
-
Per-user state is just
state["user"]. It’s whatever you put there. There’s no validated identity envelope; an upstream node can mutate it. Identity-as-state is identity-as-suggestion.
Score: 13/48. Same floor as CrewAI. Same conclusion: LangGraph is an orchestration framework, not a governance layer.
What ACP adds
acp-langchain is the same one-decorator integration as acp-crewai:
from langchain_core.tools import tool
from acp_langchain import governed, set_context
@tool
@governed("send_email")
def send_email(to: str, subject: str, body: str) -> str:
return sendmail(to, subject, body)
@app.post("/run")
def run(req: Request, authorization: str = Header(...)):
set_context(user_token=authorization.removeprefix("Bearer "))
agent = create_react_agent(model, tools=[send_email])
return agent.invoke({"messages": [("user", req.prompt)]})
Stack @governed under @tool. Same /govern/tool-use endpoint as CrewAI, Claude Code, and direct ACP calls.
Per-category breakdown
| Category | Native | + ACP | Note |
|---|---|---|---|
| Audit completeness | 1/6 | 6/6 | Every call structured-logged with attribution. |
| Cross-tenant isolation | 4/6 | 4/6 | Two declined (single-tenant deployment mode). |
| Delegation provenance | 0/6 | 2/6 | StateGraph node transitions: same gap as CrewAI handoffs. |
| Fail-mode discipline | 3/6 | 6/6 | |
| Identity propagation | 0/6 | 6/6 | End-user JWT verified per call. |
| Per-user policy enforcement | 1/6 | 6/6 | Allow/deny/rate per identity. |
| Rate-limit cascade | 3/6 | 6/6 | Fan-out aggregated per user. |
| Scope inheritance | 1/6 | 4/6 | Same root cause as delegation_provenance. |
| Total | 13/48 | 40/48 |
What this means for your LangGraph deployment
Same playbook as CrewAI:
- You’re at vanilla unless callbacks are wired. “We have LangChain logging” usually means a print statement somewhere. That’s not audit. Real audit needs structured records, attribution, trace IDs — none default.
acp-langchainis one decorator + oneset_context(). Install guide here.- Reproduce the numbers with
python -m benchmark.cli run --runner langgraph_nativeand--runner langgraph_acpfrom the agentgovbench repo.
What’s next
Tomorrow: Claude Code scorecard. Hook pattern (different from the decorator pattern in CrewAI/LangChain). Same score story, distinct failure modes — including --dangerously-skip-permissions, the documented escape hatch.
Receipts:
- agentgovbench repo
- langgraph_native runner
- langgraph_acp runner
- langgraph-native-v0.1.json results
- langgraph-acp-v0.1.json results
- /integrations/langgraph install guide
- Yesterday’s CrewAI post
- Methodology post
- 1. How we think about testing AI agent governance
- 2. CrewAI scores 13/48 on AgentGovBench. With ACP, 40/48.
- 3. CrewAI's task handoffs lose the audit trail — here's the gap and the fix
- 4. LangGraph scores 13/48 on AgentGovBench. With ACP, 40/48. · you are here
- 5. LangGraph's StateGraph checkpoints don't replay through governance
- 6. Claude Code scores 13/48 on AgentGovBench. With ACP, 43/48.
- 7. Claude Code's --dangerously-skip-permissions disables every governance hook
- 8. Decorator, proxy, hook — three patterns for agent governance, three different scorecards
- 9. OpenAI Agents SDK scores 13/48 on AgentGovBench. With ACP, 45/48.
- 10. Anthropic Agent SDK scores 13/48 on AgentGovBench. With ACP, 46/48 — best of any framework.
- 11. Codex CLI scores 13/48 on AgentGovBench. With ACP, 43/48 — same as Claude Code.
- 12. Full scorecard: seven frameworks, 48 scenarios, one open benchmark
- 13. How AgentGovBench's 48 scenarios map to NIST AI RMF 1.0
- 14. Reproduce AgentGovBench on your stack — full setup guide
- 15. Cursor scores 13/48 on AgentGovBench. With ACP MCP server, 37/48 — and that gap is structural.