Skip to content
Agentic Control Plane
Benchmark series · Part 9 of 17
AgentGovBench →

OpenAI Agents SDK scores 13/48 on AgentGovBench. With ACP, 45/48.

David Crowe · · 4 min read
openai openai-agents-sdk benchmark governance proxy agentgovbench

tl;dr

Fourth framework in the series, first proxy-pattern integration.

Configuration Score
OpenAI Agents SDK (default OpenAI client) 13/48 — vanilla floor (fourth framework confirmation)
OpenAI Agents SDK + ACP (proxy) 45/48 — closest to pure ACP

The proxy pattern wins on completeness because it sits at the natural request-serialization boundary. The SDK’s AsyncOpenAI client serializes the entire request (system prompt, tools, handoff metadata, model config) into one HTTP call to the chat-completions endpoint. Point that at ACP and the gateway sees everything.

Compare to the decorator pattern (CrewAI, LangGraph at 40/48), which only sees individual tool functions but misses framework orchestration; or the hook pattern (Claude Code at 43/48), which sees a hook payload chosen by the host.

Why OpenAI Agents SDK native scores at the floor

The OpenAI Agents SDK ships with Agent, Runner, handoffs, Guardrails, and @function_tool-decorated tools. It uses one OPENAI_API_KEY per process. Tool calls happen entirely inside the SDK; nothing reaches an external governance layer unless you wire it.

Specifically missing from the OOTB experience:

  • Per-end-user identity propagation — every call attributes to the deployment’s API key
  • Workspace-policy concept — Guardrails are SDK-internal validators, not centralized policy
  • Per-tool scope checking outside Guardrails
  • SIEM-ingestible audit log — OpenAI’s admin audit is org-level, not per-call
  • Cross-agent delegation provenance separate from the tracing product
  • Fail-mode discipline for the governance layer (because there is no governance layer)

Score: 13/48 (vanilla floor). Same as CrewAI, LangGraph, Claude Code, Anthropic Agent SDK, Cursor, and Codex CLI without their respective ACP adapters. The pattern is consistent: no framework gives you governance for free.

What ACP adds — proxy pattern

acp-openai-agents doesn’t exist. There’s no SDK to install. Just point the SDK’s AsyncOpenAI client at ACP:

from agents import Agent, Runner, set_default_openai_client
from openai import AsyncOpenAI
import os

client = AsyncOpenAI(
    base_url="https://api.agenticcontrolplane.com/v1",
    api_key=os.environ["ACP_API_KEY"],
)
set_default_openai_client(client)

That’s the integration. Every LLM call (and the tool calls the LLM emits) flows through ACP’s OpenAI-compatible proxy. Per-agent attribution via x-acp-agent-name header set on the client at agent-construction time.

Full integration guide →

Per-category breakdown

Category Native + ACP Note
Audit completeness 1/6 6/6 Every API call structured-logged at the proxy.
Cross-tenant isolation 4/6 4/6 Two declined (single-tenant deployment mode).
Delegation provenance 0/6 6/6 Best-class — proxy sees handoff metadata in the request.
Fail-mode discipline 3/6 6/6 One declined (fail_open_honored — proxy fails-closed by design).
Identity propagation 0/6 6/6 x-acp-agent-name + ACP_API_KEY identify caller per agent.
Per-user policy enforcement 1/6 6/6 Allow/deny/redact per call.
Rate-limit cascade 3/6 5/6 Per-user budget enforced. One scenario noisy.
Scope inheritance 1/6 6/6 Best-class — proxy sees full request envelope.
Total 13/48 45/48  

Why proxy beats decorator

The proxy sits at the natural request boundary. Every HTTP call the SDK was about to make to OpenAI is intercepted after the SDK has serialized everything it knows (system prompt, all messages, all tools, handoff context, model config) into one JSON payload. The proxy gets the complete picture.

The decorator pattern (CrewAI, LangGraph) intercepts at the individual tool function level. That’s smaller scope — the wrapper sees the tool name and inputs but not the broader orchestration.

This isn’t a knock on the decorator pattern; it’s the right shape when the framework is your tool dispatcher. But for OpenAI Agents SDK where everything funnels through one API call, the proxy gets natural completeness.

What this means for OpenAI Agents SDK deployments

  1. You’re at vanilla unless you’ve changed the base_url. The default OPENAI_API_KEY-only setup gives security teams no signal.
  2. The integration is a 4-line change. No SDK to install, no decorator to add. Just set_default_openai_client(AsyncOpenAI(base_url=...)).
  3. The score is meaningfully higher than decorator-pattern frameworks. If you’re choosing how to build a multi-agent system and care about governance completeness, this matters.
  4. Reproduce the numbers with python -m benchmark.cli run --runner openai_agents_native and --runner openai_agents_acp from the agentgovbench repo.

What’s next in the series

  • Anthropic Agent SDK scorecard — TS decorator wrapper that scored highest of all at 46/48
  • Codex CLI scorecard — hook pattern, identical to Claude Code at 43/48
  • Big reveal post — full scorecard, all six frameworks, all three patterns, side by side

Receipts:

Share: Twitter LinkedIn
More in AgentGovBench
  1. 1. How we think about testing AI agent governance
  2. 2. CrewAI scores 13/48 on AgentGovBench. With ACP, 40/48.
  3. 3. CrewAI's task handoffs lose the audit trail — here's the gap and the fix
  4. 4. LangGraph scores 13/48 on AgentGovBench. With ACP, 40/48.
  5. 5. LangGraph's StateGraph checkpoints don't replay through governance
  6. 6. Claude Code scores 13/48 on AgentGovBench. With ACP, 43/48.
  7. 7. Claude Code's --dangerously-skip-permissions disables every governance hook
  8. 8. Decorator, proxy, hook — three patterns for agent governance, three different scorecards
  9. 9. OpenAI Agents SDK scores 13/48 on AgentGovBench. With ACP, 45/48. · you are here
  10. 10. Anthropic Agent SDK scores 13/48 on AgentGovBench. With ACP, 46/48 — best of any framework.
  11. 11. Codex CLI scores 13/48 on AgentGovBench. With ACP, 43/48 — same as Claude Code.
  12. 12. Full scorecard: seven frameworks, 48 scenarios, one open benchmark
  13. 13. How AgentGovBench's 48 scenarios map to NIST AI RMF 1.0
  14. 14. Reproduce AgentGovBench on your stack — full setup guide
  15. 15. Cursor scores 13/48 on AgentGovBench. With ACP MCP server, 37/48 — and that gap is structural.
  16. 16. Recommended governance deployment patterns — pick the one that scores highest for your stack
  17. 17. What our benchmark told us about our own product — six fixes we're shipping
Related posts

← back to blog