Codex CLI scores 13/48 on AgentGovBench. With ACP, 43/48 — same as Claude Code.
tl;dr
Sixth framework. Same hook pattern as Claude Code. Same score.
| Configuration | Score |
|---|---|
| Codex CLI (no PreToolUse hook) | 13/48 — vanilla floor (sixth framework confirmation) |
| Codex CLI + ACP hook | 43/48 — identical to Claude Code |
The identical score isn’t a coincidence — same pattern shape, same gateway, same scorecard outcome. Worth noting because it confirms the hook pattern produces a consistent ~43/48 ceiling.
But Codex CLI has one meaningful governance differentiator over Claude Code: its auto-approve mode keeps hooks firing. Where Claude Code’s --dangerously-skip-permissions disables every hook entirely (the audit-silence gap we covered), Codex CLI’s auto mode just suppresses the interactive prompt — ACP’s hook still runs, audit still populates.
For teams running unattended coding agents, this matters.
Why Codex CLI native scores at the floor
OpenAI’s Codex CLI is the terminal coding agent — analogous to Claude Code, similar PreToolUse hook semantics, similar deployment shape. Without an ACP hook installed:
- Tool calls run with interactive permission prompts (or auto-approve in
--auto) - TTY output captures all tool inputs/outputs (debug, not audit)
- MCP server connections work but carry no end-user identity
- No SIEM-ingestible audit log, no policy enforcement
Score: 13/48 — same as every other framework’s bare default. Seven frameworks now confirmed at this floor: CrewAI, LangGraph, Claude Code, OpenAI Agents SDK, Anthropic Agent SDK, Cursor, Codex CLI.
What ACP adds
Same ~/.acp/govern.mjs hook script, same ~/.codex/config.json registration. Install:
curl -sf https://agenticcontrolplane.com/install.sh | bash
The installer detects Codex CLI and registers the hook for both PreToolUse and PostToolUse events. Every tool call dispatched by Codex CLI (Bash, Edit, Write, MCP, etc.) flows through /govern/tool-use → workspace policy → audit.
Full Codex CLI integration guide →
Per-category breakdown — identical to Claude Code
| Category | Native | + ACP | Same as Claude Code? |
|---|---|---|---|
| Audit completeness | 1/6 | 6/6 | ✓ |
| Cross-tenant isolation | 4/6 | 4/6 | ✓ |
| Delegation provenance | 0/6 | 6/6 | ✓ |
| Fail-mode discipline | 3/6 | 4/6 | ✓ (fail_open_honored declined) |
| Identity propagation | 0/6 | 6/6 | ✓ |
| Per-user policy enforcement | 1/6 | 6/6 | ✓ |
| Rate-limit cascade | 3/6 | 5/6 | ✓ |
| Scope inheritance | 1/6 | 6/6 | ✓ |
| Total | 13/48 | 43/48 | ✓ identical |
Same hook protocol → same payload shape → same scorecard. This is the consistency a benchmark should produce when two integrations share an architectural pattern.
The one governance differentiator: auto mode
Both Codex CLI and Claude Code have ways to suppress the interactive permission prompt. They differ in what happens to hooks:
Claude Code --dangerously-skip-permissions |
Codex CLI --auto |
|
|---|---|---|
| Interactive prompt | Suppressed | Suppressed |
| PreToolUse hook fires | No ❌ | Yes ✓ |
| PostToolUse hook fires | No ❌ | Yes ✓ |
| ACP audit log populates | No | Yes |
| Server-side detection | Hard | Trivial |
This is a meaningful trust difference. In Claude Code, an --dangerously-skip-permissions user is invisible to governance for the duration of that session. In Codex CLI, an --auto user has the full audit trail — they just skipped the interactive Y/n.
Practically: if your team runs unattended coding agents (CI tasks, scheduled jobs, batch processing), Codex CLI’s auto mode preserves audit. Claude Code’s escape hatch breaks it.
This isn’t a feature claim from us — it’s a documented OpenAI design choice we’re surfacing because it affects governance posture.
What this means for your team
If you’re choosing between Codex CLI and Claude Code for a team deployment with audit requirements:
For interactive use: scores are identical. Pick on UX preference.
For unattended / batch use: Codex CLI is structurally safer because hooks survive auto mode. Claude Code’s --dangerously-skip-permissions is an audit-silence event you have to detect and mitigate (three mitigations here).
For multi-agent / delegation-heavy workflows: both are constrained by the hook pattern’s lack of orchestration visibility, same as decorator-pattern frameworks. Consider the proxy or handler-wrapper patterns instead.
What’s next in the series
- Cursor scorecard — landing today, MCP pattern with a structural gap
- Big reveal post — full scorecard, all seven frameworks, all four patterns
- NIST methodology post — how the 48 scenarios map to AI RMF 1.0
- Reproducibility — running the benchmark on your stack
Receipts:
- codex_native runner
- codex_acp runner
- Results JSON
- /integrations/codex install guide
- Claude Code scorecard — for direct comparison
- Decorator vs proxy vs hook
- 1. How we think about testing AI agent governance
- 2. CrewAI scores 13/48 on AgentGovBench. With ACP, 40/48.
- 3. CrewAI's task handoffs lose the audit trail — here's the gap and the fix
- 4. LangGraph scores 13/48 on AgentGovBench. With ACP, 40/48.
- 5. LangGraph's StateGraph checkpoints don't replay through governance
- 6. Claude Code scores 13/48 on AgentGovBench. With ACP, 43/48.
- 7. Claude Code's --dangerously-skip-permissions disables every governance hook
- 8. Decorator, proxy, hook — three patterns for agent governance, three different scorecards
- 9. OpenAI Agents SDK scores 13/48 on AgentGovBench. With ACP, 45/48.
- 10. Anthropic Agent SDK scores 13/48 on AgentGovBench. With ACP, 46/48 — best of any framework.
- 11. Codex CLI scores 13/48 on AgentGovBench. With ACP, 43/48 — same as Claude Code. · you are here
- 12. Full scorecard: seven frameworks, 48 scenarios, one open benchmark
- 13. How AgentGovBench's 48 scenarios map to NIST AI RMF 1.0
- 14. Reproduce AgentGovBench on your stack — full setup guide
- 15. Cursor scores 13/48 on AgentGovBench. With ACP MCP server, 37/48 — and that gap is structural.
- 16. Recommended governance deployment patterns — pick the one that scores highest for your stack
- 17. What our benchmark told us about our own product — six fixes we're shipping