OpenAI Agents SDK scores 13/48 on AgentGovBench. With ACP, 45/48.
tl;dr
Fourth framework in the series, first proxy-pattern integration.
| Configuration | Score |
|---|---|
| OpenAI Agents SDK (default OpenAI client) | 13/48 — vanilla floor (fourth framework confirmation) |
| OpenAI Agents SDK + ACP (proxy) | 45/48 — closest to pure ACP |
The proxy pattern wins on completeness because it sits at the natural request-serialization boundary. The SDK’s AsyncOpenAI client serializes the entire request (system prompt, tools, handoff metadata, model config) into one HTTP call to the chat-completions endpoint. Point that at ACP and the gateway sees everything.
Compare to the decorator pattern (CrewAI, LangGraph at 40/48), which only sees individual tool functions but misses framework orchestration; or the hook pattern (Claude Code at 43/48), which sees a hook payload chosen by the host.
Why OpenAI Agents SDK native scores at the floor
The OpenAI Agents SDK ships with Agent, Runner, handoffs, Guardrails, and @function_tool-decorated tools. It uses one OPENAI_API_KEY per process. Tool calls happen entirely inside the SDK; nothing reaches an external governance layer unless you wire it.
Specifically missing from the OOTB experience:
- Per-end-user identity propagation — every call attributes to the deployment’s API key
- Workspace-policy concept — Guardrails are SDK-internal validators, not centralized policy
- Per-tool scope checking outside Guardrails
- SIEM-ingestible audit log — OpenAI’s admin audit is org-level, not per-call
- Cross-agent delegation provenance separate from the tracing product
- Fail-mode discipline for the governance layer (because there is no governance layer)
Score: 13/48 (vanilla floor). Same as CrewAI, LangGraph, Claude Code, Anthropic Agent SDK, Cursor, and Codex CLI without their respective ACP adapters. The pattern is consistent: no framework gives you governance for free.
What ACP adds — proxy pattern
acp-openai-agents doesn’t exist. There’s no SDK to install. Just point the SDK’s AsyncOpenAI client at ACP:
from agents import Agent, Runner, set_default_openai_client
from openai import AsyncOpenAI
import os
client = AsyncOpenAI(
base_url="https://api.agenticcontrolplane.com/v1",
api_key=os.environ["ACP_API_KEY"],
)
set_default_openai_client(client)
That’s the integration. Every LLM call (and the tool calls the LLM emits) flows through ACP’s OpenAI-compatible proxy. Per-agent attribution via x-acp-agent-name header set on the client at agent-construction time.
Per-category breakdown
| Category | Native | + ACP | Note |
|---|---|---|---|
| Audit completeness | 1/6 | 6/6 | Every API call structured-logged at the proxy. |
| Cross-tenant isolation | 4/6 | 4/6 | Two declined (single-tenant deployment mode). |
| Delegation provenance | 0/6 | 6/6 | Best-class — proxy sees handoff metadata in the request. |
| Fail-mode discipline | 3/6 | 6/6 | One declined (fail_open_honored — proxy fails-closed by design). |
| Identity propagation | 0/6 | 6/6 | x-acp-agent-name + ACP_API_KEY identify caller per agent. |
| Per-user policy enforcement | 1/6 | 6/6 | Allow/deny/redact per call. |
| Rate-limit cascade | 3/6 | 5/6 | Per-user budget enforced. One scenario noisy. |
| Scope inheritance | 1/6 | 6/6 | Best-class — proxy sees full request envelope. |
| Total | 13/48 | 45/48 |
Why proxy beats decorator
The proxy sits at the natural request boundary. Every HTTP call the SDK was about to make to OpenAI is intercepted after the SDK has serialized everything it knows (system prompt, all messages, all tools, handoff context, model config) into one JSON payload. The proxy gets the complete picture.
The decorator pattern (CrewAI, LangGraph) intercepts at the individual tool function level. That’s smaller scope — the wrapper sees the tool name and inputs but not the broader orchestration.
This isn’t a knock on the decorator pattern; it’s the right shape when the framework is your tool dispatcher. But for OpenAI Agents SDK where everything funnels through one API call, the proxy gets natural completeness.
What this means for OpenAI Agents SDK deployments
- You’re at vanilla unless you’ve changed the
base_url. The defaultOPENAI_API_KEY-only setup gives security teams no signal. - The integration is a 4-line change. No SDK to install, no decorator to add. Just
set_default_openai_client(AsyncOpenAI(base_url=...)). - The score is meaningfully higher than decorator-pattern frameworks. If you’re choosing how to build a multi-agent system and care about governance completeness, this matters.
- Reproduce the numbers with
python -m benchmark.cli run --runner openai_agents_nativeand--runner openai_agents_acpfrom the agentgovbench repo.
What’s next in the series
- Anthropic Agent SDK scorecard — TS decorator wrapper that scored highest of all at 46/48
- Codex CLI scorecard — hook pattern, identical to Claude Code at 43/48
- Big reveal post — full scorecard, all six frameworks, all three patterns, side by side
Receipts:
- openai_agents_native runner
- openai_agents_acp runner
- Results JSON
- /integrations/openai-agents-sdk install guide
- Decorator vs proxy vs hook
- Methodology post
- 1. How we think about testing AI agent governance
- 2. CrewAI scores 13/48 on AgentGovBench. With ACP, 40/48.
- 3. CrewAI's task handoffs lose the audit trail — here's the gap and the fix
- 4. LangGraph scores 13/48 on AgentGovBench. With ACP, 40/48.
- 5. LangGraph's StateGraph checkpoints don't replay through governance
- 6. Claude Code scores 13/48 on AgentGovBench. With ACP, 43/48.
- 7. Claude Code's --dangerously-skip-permissions disables every governance hook
- 8. Decorator, proxy, hook — three patterns for agent governance, three different scorecards
- 9. OpenAI Agents SDK scores 13/48 on AgentGovBench. With ACP, 45/48. · you are here
- 10. Anthropic Agent SDK scores 13/48 on AgentGovBench. With ACP, 46/48 — best of any framework.
- 11. Codex CLI scores 13/48 on AgentGovBench. With ACP, 43/48 — same as Claude Code.
- 12. Full scorecard: seven frameworks, 48 scenarios, one open benchmark
- 13. How AgentGovBench's 48 scenarios map to NIST AI RMF 1.0
- 14. Reproduce AgentGovBench on your stack — full setup guide
- 15. Cursor scores 13/48 on AgentGovBench. With ACP MCP server, 37/48 — and that gap is structural.
- 16. Recommended governance deployment patterns — pick the one that scores highest for your stack
- 17. What our benchmark told us about our own product — six fixes we're shipping