Skip to content
Agentic Control Plane
Benchmark series · Part 11 of 17
AgentGovBench →

Codex CLI scores 13/48 on AgentGovBench. With ACP, 43/48 — same as Claude Code.

David Crowe · · 3 min read
openai codex codex-cli benchmark governance hook agentgovbench

tl;dr

Sixth framework. Same hook pattern as Claude Code. Same score.

Configuration Score
Codex CLI (no PreToolUse hook) 13/48 — vanilla floor (sixth framework confirmation)
Codex CLI + ACP hook 43/48 — identical to Claude Code

The identical score isn’t a coincidence — same pattern shape, same gateway, same scorecard outcome. Worth noting because it confirms the hook pattern produces a consistent ~43/48 ceiling.

But Codex CLI has one meaningful governance differentiator over Claude Code: its auto-approve mode keeps hooks firing. Where Claude Code’s --dangerously-skip-permissions disables every hook entirely (the audit-silence gap we covered), Codex CLI’s auto mode just suppresses the interactive prompt — ACP’s hook still runs, audit still populates.

For teams running unattended coding agents, this matters.

Why Codex CLI native scores at the floor

OpenAI’s Codex CLI is the terminal coding agent — analogous to Claude Code, similar PreToolUse hook semantics, similar deployment shape. Without an ACP hook installed:

  • Tool calls run with interactive permission prompts (or auto-approve in --auto)
  • TTY output captures all tool inputs/outputs (debug, not audit)
  • MCP server connections work but carry no end-user identity
  • No SIEM-ingestible audit log, no policy enforcement

Score: 13/48 — same as every other framework’s bare default. Seven frameworks now confirmed at this floor: CrewAI, LangGraph, Claude Code, OpenAI Agents SDK, Anthropic Agent SDK, Cursor, Codex CLI.

What ACP adds

Same ~/.acp/govern.mjs hook script, same ~/.codex/config.json registration. Install:

curl -sf https://agenticcontrolplane.com/install.sh | bash

The installer detects Codex CLI and registers the hook for both PreToolUse and PostToolUse events. Every tool call dispatched by Codex CLI (Bash, Edit, Write, MCP, etc.) flows through /govern/tool-use → workspace policy → audit.

Full Codex CLI integration guide →

Per-category breakdown — identical to Claude Code

Category Native + ACP Same as Claude Code?
Audit completeness 1/6 6/6
Cross-tenant isolation 4/6 4/6
Delegation provenance 0/6 6/6
Fail-mode discipline 3/6 4/6 ✓ (fail_open_honored declined)
Identity propagation 0/6 6/6
Per-user policy enforcement 1/6 6/6
Rate-limit cascade 3/6 5/6
Scope inheritance 1/6 6/6
Total 13/48 43/48 ✓ identical

Same hook protocol → same payload shape → same scorecard. This is the consistency a benchmark should produce when two integrations share an architectural pattern.

The one governance differentiator: auto mode

Both Codex CLI and Claude Code have ways to suppress the interactive permission prompt. They differ in what happens to hooks:

  Claude Code --dangerously-skip-permissions Codex CLI --auto
Interactive prompt Suppressed Suppressed
PreToolUse hook fires No Yes
PostToolUse hook fires No Yes
ACP audit log populates No Yes
Server-side detection Hard Trivial

This is a meaningful trust difference. In Claude Code, an --dangerously-skip-permissions user is invisible to governance for the duration of that session. In Codex CLI, an --auto user has the full audit trail — they just skipped the interactive Y/n.

Practically: if your team runs unattended coding agents (CI tasks, scheduled jobs, batch processing), Codex CLI’s auto mode preserves audit. Claude Code’s escape hatch breaks it.

This isn’t a feature claim from us — it’s a documented OpenAI design choice we’re surfacing because it affects governance posture.

What this means for your team

If you’re choosing between Codex CLI and Claude Code for a team deployment with audit requirements:

For interactive use: scores are identical. Pick on UX preference.

For unattended / batch use: Codex CLI is structurally safer because hooks survive auto mode. Claude Code’s --dangerously-skip-permissions is an audit-silence event you have to detect and mitigate (three mitigations here).

For multi-agent / delegation-heavy workflows: both are constrained by the hook pattern’s lack of orchestration visibility, same as decorator-pattern frameworks. Consider the proxy or handler-wrapper patterns instead.

What’s next in the series

  • Cursor scorecard — landing today, MCP pattern with a structural gap
  • Big reveal post — full scorecard, all seven frameworks, all four patterns
  • NIST methodology post — how the 48 scenarios map to AI RMF 1.0
  • Reproducibility — running the benchmark on your stack

Receipts:

Share: Twitter LinkedIn
More in AgentGovBench
  1. 1. How we think about testing AI agent governance
  2. 2. CrewAI scores 13/48 on AgentGovBench. With ACP, 40/48.
  3. 3. CrewAI's task handoffs lose the audit trail — here's the gap and the fix
  4. 4. LangGraph scores 13/48 on AgentGovBench. With ACP, 40/48.
  5. 5. LangGraph's StateGraph checkpoints don't replay through governance
  6. 6. Claude Code scores 13/48 on AgentGovBench. With ACP, 43/48.
  7. 7. Claude Code's --dangerously-skip-permissions disables every governance hook
  8. 8. Decorator, proxy, hook — three patterns for agent governance, three different scorecards
  9. 9. OpenAI Agents SDK scores 13/48 on AgentGovBench. With ACP, 45/48.
  10. 10. Anthropic Agent SDK scores 13/48 on AgentGovBench. With ACP, 46/48 — best of any framework.
  11. 11. Codex CLI scores 13/48 on AgentGovBench. With ACP, 43/48 — same as Claude Code. · you are here
  12. 12. Full scorecard: seven frameworks, 48 scenarios, one open benchmark
  13. 13. How AgentGovBench's 48 scenarios map to NIST AI RMF 1.0
  14. 14. Reproduce AgentGovBench on your stack — full setup guide
  15. 15. Cursor scores 13/48 on AgentGovBench. With ACP MCP server, 37/48 — and that gap is structural.
  16. 16. Recommended governance deployment patterns — pick the one that scores highest for your stack
  17. 17. What our benchmark told us about our own product — six fixes we're shipping
Related posts

← back to blog