Skip to content
Agentic Control Plane
Benchmark series · Part 5 of 15
AgentGovBench →

LangGraph's StateGraph checkpoints don't replay through governance

David Crowe · · 5 min read
langgraph governance stategraph checkpoint agentgovbench

tl;dr

LangGraph’s checkpoint mechanism is one of its best features — pause a graph, resume hours later, replay from a fork point. It’s also a governance blind spot.

When the graph resumes, the @governed wrapper re-runs against the new world (current policy, current rate-limit budget, current PII rules). But the previous tool calls — the ones embedded in the checkpoint state — don’t re-run. They’re frozen in the state.

If policy changed between the original run and the replay (a tool was deauthorized, a user was suspended, a rate-limit was tightened), the replayed graph carries forward decisions that wouldn’t be made today. The audit log shows the original allow; the world has moved on.

This is the second LangGraph failure mode in our benchmark series — distinct from the CrewAI handoff gap, but related root cause.

The setup

A typical LangGraph supervisor-worker pattern:

from langgraph.graph import StateGraph, START, END
from langgraph.checkpoint.memory import MemorySaver

builder = StateGraph(State)
builder.add_node("supervisor", supervisor_fn)
builder.add_node("worker_a", worker_a_fn)
builder.add_node("worker_b", worker_b_fn)
builder.add_conditional_edges("supervisor", routing_fn)
builder.add_edge("worker_a", END)
builder.add_edge("worker_b", END)

graph = builder.compile(checkpointer=MemorySaver())
config = {"configurable": {"thread_id": "alice-session-42"}}

Tools the workers call are @tool @governed("...")-decorated. Each call hits ACP. So far, so good.

Now the user’s request hits the supervisor. It routes to worker_a. worker_a calls db_query (@governed). ACP allows. Audit row written. State checkpointed.

Then the request times out, or the user navigates away, or something crashes.

An hour later, the user resumes:

graph.invoke(None, config=config)  # resume from checkpoint

The checkpoint replays the graph from the saved state. worker_a already ran, so it’s not re-invoked. The graph picks up at the routing decision after worker_a. Now the supervisor decides to invoke worker_b. worker_b calls notify_user (@governed). ACP evaluates against current policy and audits the new call.

But what about worker_a’s db_query from an hour ago? Its result is in the state. Other nodes consume it. None of those consumers go through governance — they’re just reading state. If db_query returned PII that would now be blocked, it’s still in the state, still feeding downstream tools.

The specific failures

In AgentGovBench, this maps to two scenarios that LangGraph + ACP fails:

per_user_policy_enforcement.05_revocation_during_session — A scenario where a user’s permission to call a sensitive tool is revoked partway through a session. After revocation, the tool is correctly denied. But if a previous call returned data that the next call uses, the second call sees data the policy now forbids surfacing. ACP doesn’t see the chain because the second call’s input came from state, not a fresh tool call.

scope_inheritance.06_state_carries_unredacted — Tool A returned PII; the post-output redaction was applied to the audit log entry, not to the in-memory state. State now contains unredacted PII. Tool B (same session, same user, but different policy because it’s send_email) reads from state and proceeds — the pre-tool-use check on tool B sees the input but doesn’t know it traces back to a redacted output.

Both failures share a root cause: LangGraph state is opaque to the governance layer. The governance layer sees individual tool calls; it doesn’t see the data passing between them.

Why this matters

Three audiences should care:

1. Compliance teams. EU AI Act Article 14 requires “appropriate transparency” about high-risk system operation. A system whose audit log shows “tool A allowed at 10:00, policy revoked at 11:00, tool B allowed at 12:00 with input that was forbidden output of tool A” doesn’t read as transparent — it reads as obscure.

2. Security incident responders. When an incident hits, the timeline matters. Replay state masks timeline — a checkpoint resumed today appears as “today’s calls” in the audit log, even though the data feeding them is from yesterday’s pre-revocation state.

3. Anyone running long-lived LangGraph sessions. Customer support agents, multi-turn research bots, anything that uses MemorySaver or external checkpointers (PostgresSaver, RedisSaver). The longer the session, the more divergence between when data was governed and when it’s used.

What to do today

Three workarounds, in order of how invasive they are:

1. Re-evaluate state on resume. Before graph.invoke(None, config=config), walk the checkpoint state and re-run the relevant @governed checks on cached tool outputs. The SDK doesn’t do this automatically — you’d write it as a wrapper.

2. Shorten checkpoint lifetimes. Configure your checkpointer to expire state aggressively (hours, not days). Long-lived state is the risk; short-lived state limits exposure.

3. Move PII redaction upstream. If the post-output redaction also rewrote the state (not just the audit), state would be safe to replay. This requires either an SDK change to acp-langchain (have @governed rewrite the value passed back to LangGraph) or a custom callback.

What we’re shipping

acp-langchain@0.2.0 will add a governed_state_resume(config) helper that walks a checkpoint state, identifies tool outputs by trace ID, and re-evaluates each through the gateway against current policy. Stale tool outputs that would no longer be allowed are surfaced as a list — your application decides whether to abort the resume, redact the state, or proceed with an audit annotation.

This is the right shape because it puts the call on the application: “your state was governed when written; on resume, here’s what changed; you decide.”

Tracking issue and design doc: [TODO]. Expected in acp-langchain@0.2.0.

What’s next in the series

Tomorrow: Claude Code scorecard. Same vanilla-floor → ACP-enforced jump, plus the documented --dangerously-skip-permissions gap.


Receipts:

Share: Twitter LinkedIn
More in AgentGovBench
  1. 1. How we think about testing AI agent governance
  2. 2. CrewAI scores 13/48 on AgentGovBench. With ACP, 40/48.
  3. 3. CrewAI's task handoffs lose the audit trail — here's the gap and the fix
  4. 4. LangGraph scores 13/48 on AgentGovBench. With ACP, 40/48.
  5. 5. LangGraph's StateGraph checkpoints don't replay through governance · you are here
  6. 6. Claude Code scores 13/48 on AgentGovBench. With ACP, 43/48.
  7. 7. Claude Code's --dangerously-skip-permissions disables every governance hook
  8. 8. Decorator, proxy, hook — three patterns for agent governance, three different scorecards
  9. 9. OpenAI Agents SDK scores 13/48 on AgentGovBench. With ACP, 45/48.
  10. 10. Anthropic Agent SDK scores 13/48 on AgentGovBench. With ACP, 46/48 — best of any framework.
  11. 11. Codex CLI scores 13/48 on AgentGovBench. With ACP, 43/48 — same as Claude Code.
  12. 12. Full scorecard: seven frameworks, 48 scenarios, one open benchmark
  13. 13. How AgentGovBench's 48 scenarios map to NIST AI RMF 1.0
  14. 14. Reproduce AgentGovBench on your stack — full setup guide
  15. 15. Cursor scores 13/48 on AgentGovBench. With ACP MCP server, 37/48 — and that gap is structural.
Related posts

← back to blog