Brian Story

This week, a new benchmark proved that the AI race isn't as close as everyone thought. The creator of the most popular open-source AI agent joined OpenAI. And a security tool broke every agent it tested in under 30 seconds.

None of these are incremental. Here's what happened, what it means, and what you should do about it.

▸ The Benchmark Scam

SWE-bench has been the gold standard for measuring whether an AI can write code. Every lab references it. Every model announcement leads with it. It became the number everyone optimizes for.

That's exactly the problem.

How Benchmark Gaming Works

SWE-bench's questions are public. If you're a lab that wants impressive numbers, you don't need to build a better model. You train your model on the test answers. Overfit on those specific problems. Your leaderboard score goes up. Your model doesn't actually get smarter.

This isn't hypothetical. IBM researchers have noted that models may have seen benchmark data in training. Top models score 75-80% on SWE-bench Verified but only 55-58% on SWE-rebench (monthly fresh tasks) and 15-23% on SWE-bench Pro with private codebases.

SWE-rebench: The Cheat-Proof Test

SWE-rebench fixes this by pulling fresh GitHub tasks from recent repositories every month. Same difficulty. Same format. Problems no model has ever seen in training.

The January 2026 results tell the real story.

On the original SWE-bench (public questions):

Model	Score
Claude Opus 4.6	80.8%
MiniMax M2.5	80.2%

On paper, basically identical. The narrative everywhere was that Chinese open-source had caught up to frontier models.

On SWE-rebench (fresh, unseen problems):

Model	Score
Claude Code (Opus 4.6)	52.9%
Claude Opus 4.6	51.7%
GPT-5.2 (xhigh)	51.7%
GPT-5.2 (medium)	51.0%
Sonnet 4.5	47.1%
Gemini 3 Pro Preview	46.7%
Codex	44.0%
Kimi K2 Thinking	43.8%
GLM-5	42.1%
Qwen3-Coder-Next	40.0%
MiniMax M2.5	39.6%

MiniMax went from "matching Opus" to a 12-point gap. The entire top tier is Anthropic, OpenAI, and Google. Chinese models cluster 8-13 points behind.

These models were never comparable. MiniMax was trained to ace that specific test.

What This Means for You

If you're evaluating AI models for your stack, stop using benchmark scores as your primary signal. Here's what to do instead:

Build your own eval set. Take 20-50 real problems from your codebase. Things your team actually solved last quarter. Run every candidate model against them. This is your private SWE-rebench.

Test on fresh problems. Models that score well on problems from 2024 may be contaminated with that data. Use recent GitHub issues or problems you write yourself.

Cost-adjust your comparison. MiniMax M2.5 costs roughly $0.09 per SWE-bench problem versus $3.50 for Claude Code. Sounds like a slam dunk for MiniMax. But if MiniMax solves 39.6% of fresh problems and Claude solves 52.9%, you need to run MiniMax roughly 1.3x more to get the same number of solved problems. Always compare cost-per-success, not cost-per-call.

Use at least two models on your hardest problems. The gap between your top two candidates on your actual work is the only number that matters.

The bottom line: the US-China AI gap is real, significant, and was being hidden by contaminated benchmarks. Anthropic and OpenAI are at least 6-12 months ahead on genuine capability. Plan accordingly.

▸ OpenClaw's Creator Joins OpenAI

Peter Steinberger, the Austrian developer who built OpenClaw (the viral AI personal assistant that went from Clawdbot to Moltbot to OpenClaw in the span of weeks), announced yesterday that he's joining OpenAI.

Sam Altman posted on X that Steinberger will "drive the next generation of personal agents" and called him "a genius with a lot of amazing ideas about the future of very smart agents interacting with each other."

OpenClaw will move into a foundation and remain open source. OpenAI has committed to sponsoring the project.

Why This Matters

Steinberger's blog post is worth reading in full. The key quote: "What I want is to change the world, not build a large company, and teaming up with OpenAI is the fastest way to bring this to everyone."

He spent last week in San Francisco talking with multiple major labs before choosing OpenAI. His stated mission: "build an agent that even my mum can use." That's a UX problem, not a model problem, and it's exactly where OpenAI has been weakest compared to Anthropic's Claude Code.

The Optimistic Read

OpenAI gets someone who's actually shipped an agent people use daily. Not a research demo, not a benchmark leader. A product. Steinberger understands the messy reality of agents managing calendars, booking flights, and operating across different messaging platforms. That practical knowledge is rare.

The foundation structure for OpenClaw is the right move. The community built around it is genuinely active, spreading rapidly through China via DeepSeek integration and into enterprise use cases. A foundation gives it a life independent of OpenAI's product roadmap.

The Skeptical Read

This concentrates more talent inside the biggest player. Steinberger could have built an independent company that kept pressure on the incumbents. Instead, the most prominent open-source agent developer is now inside the company that would benefit most from agents becoming a closed-ecosystem product.

"Live in a foundation" is a commitment that gets tested the first time a feature would help ChatGPT's product but not the open-source community. History is full of corporate-sponsored open-source projects where the best features quietly land in the paid product first.

And there's the China angle. OpenClaw spread rapidly through Chinese messaging apps and pairs with Chinese-developed models like DeepSeek. Baidu planned to integrate it directly. An OpenAI affiliation may cool that enthusiasm. For a project whose appeal was model-agnostic independence, that's a real risk.

My Take

This is a talent acquisition dressed up as a partnership. That's not inherently bad. Steinberger gets frontier model access and research resources. OpenAI gets the person who proved personal agents have real consumer demand. The community gets a foundation with corporate sponsorship.

The question is whether the foundation has real governance or is decorative. Watch the bylaws, board composition, and contribution policies. That's where you'll see whether OpenClaw remains genuinely independent or becomes OpenAI's community edition.

For those of us running OpenClaw: nothing changes today. The project is open source and will stay that way. But it's worth having a contingency plan. That's not paranoia. That's engineering.

▸ Your AI Agent Is Probably Broken

Khaos is an open-source security testing framework built specifically for AI agents. Not LLMs in general. Agents: the systems that take actions, call APIs, and operate autonomously.

What It Tests

Khaos injects 242+ attack patterns against your agents:

▹Prompt injection that bypasses LangChain's prompt templates via tool outputs
▹Tool misuse with malicious parameters (including OS command injection in code execution tools)
▹Chain failures where intermediate steps time out or return corrupted state
▹LLM faults like rate limits, token overflows, and model unavailability

It works with LangChain, LangGraph, LCEL chains, and classic agents. It auto-instruments your chains to inject faults at each step.

Why LangChain Specifically

LangChain's abstraction layers hide vulnerabilities that Khaos exposes:

▹Prompt templates can still be injected via tool outputs (the template is safe, but the data flowing through it isn't)
▹AgentExecutor doesn't validate tool parameters
▹Chains fail silently or propagate corrupted state to the next step
▹ReAct and Plan-and-Execute patterns have unique attack surfaces because they loop

How to Run It

pip install khaos-agent
khaos discover          # finds your agents
khaos run your-agent --pack security   # runs 242+ attacks

You get back a concrete report: what broke, how, and where. Not "potential vulnerabilities." Actual exploits.

What to Do

Run Khaos against any agents you have in production or staging. Today. Then add it to your CI/CD pipeline so new deployments get tested automatically.

The security gap most teams miss: they test the LLM's safety filters but not the agent's tool-calling boundaries. Your model might refuse to generate harmful content, but your agent might still execute a malicious database query if the prompt injection is clever enough. These are different attack surfaces and need different testing.

Source: GitHub | Hacker News

▸ Key Takeaways

▹
Benchmarks are compromised. SWE-rebench proved the US-China AI gap is 12+ points wider than public benchmarks showed. Build your own evals on fresh problems.
▹
OpenClaw joining OpenAI is significant. Good for the product, uncertain for the community. Watch the foundation governance, not the press release.
▹
Test your agents for security now. Khaos found vulnerabilities in every agent it tested. LangChain's abstractions hide real attack surfaces. Three commands to find out if yours is exposed.
▹
Own your evaluation process. The models that score highest on public benchmarks aren't necessarily the best for your use case. The only test that matters is your data, your problems, your requirements.

▸ Work With Me

Picking the right model for your stack. Securing your agents. Knowing when self-hosting beats API costs. These are the decisions that separate companies using AI effectively from companies burning money on hype.

I run AI audits and build implementation plans. Not slide decks. Working systems.

Book a discovery call →

This is the companion deep-dive to Issue #1 of Own Your AI Brief. Subscribe here for weekly tactics delivered to your inbox.

The AI Benchmark Scam, OpenClaw's Big Move, and Why Your Agents Are Broken