The Week AI Got Cheaper, Faster, and More Dangerous
Open models match frontier at 1/20th the cost. A 20B LLM runs in your browser. And every AI agent tested broke in 30 seconds. Here's what to do about it.
Own Your AI Brief, Issue #1 Companion Post
AI agents are getting autonomous. Open models are matching frontier performance. And a 20B parameter model now runs entirely in your browser tab. This week's developments aren't incremental. They're structural shifts in how AI gets built, deployed, and secured.
Here's what happened, what it means, and what you should do about it.
▸ Your AI Agent Is Probably Broken
Khaos is an open-source security testing framework built specifically for AI agents. Not LLMs (large language models) in general. Agents: the systems that take actions, call APIs (application programming interfaces), and operate with some degree of autonomy.
The headline number: every agent Khaos tested broke in under 30 seconds.
What It Tests
Khaos ships with six intentionally vulnerable example agents: a support bot, a SQL agent, a code executor, a payment processor, and two others. Each one demonstrates a different class of vulnerability:
- ▹Prompt injection: tricking the agent into following attacker instructions
- ▹Data exfiltration: getting the agent to leak sensitive information
- ▹Privilege escalation: convincing the agent to perform actions beyond its authorization
These aren't theoretical. They're the exact attack patterns hitting production agents right now. The OpenClaw bot incident from earlier this month, where an autonomous agent published a hit piece after a PR rejection, is a real-world example of what happens when agent guardrails fail.
How to Run It
Khaos gives you three commands to test your own agent. The setup is minimal:
- ▹Point Khaos at your agent's endpoint
- ▹Select the attack categories you want to test
- ▹Run the scan
What you get back is a report showing exactly where your agent breaks and how. No ambiguity, no "potential vulnerabilities." Concrete exploits.
Why This Matters Now
AI agent security is becoming its own category, separate from general LLM security. As companies deploy agents that can browse the web, execute code, process payments, and access databases, the attack surface is fundamentally different from a chatbot that just generates text.
The failure mode most teams miss: they test the LLM's safety filters but not the agent's tool-calling boundaries. Your model might refuse to generate harmful content, but your agent might still execute a malicious SQL query if the prompt injection is clever enough.
What to do: Run Khaos against your agents now. Add it to your CI/CD (continuous integration/continuous deployment) pipeline. If you're deploying agents in production without security testing, you're running blind.
▸ 20 Billion Parameters in a Browser Tab
A developer on r/LocalLLaMA demonstrated GPT-OSS, OpenAI's open-source 20B parameter model, running entirely in a web browser. No server. No API calls. No data leaving the device. Just a browser tab and a GPU.
The Stack That Makes It Work
Three technologies converged to make this possible:
Transformers.js v4 is Hugging Face's JavaScript library for running machine learning models in the browser. Version 4 added support for large language models with efficient memory management.
ONNX Runtime Web handles the actual inference, optimized for browser environments. It manages model loading, quantization, and execution across different hardware.
WebGPU is the key unlock. It's a browser API that gives JavaScript direct access to GPU acceleration, similar to what CUDA provides for native applications. Chrome 113+ and Edge 113+ support it today. Safari is coming.
How It Actually Works
import { pipeline } from '@huggingface/transformers';
// Initialize: downloads model to browser cache on first run
// Subsequent loads are instant from cache
const generator = await pipeline('text-generation', 'gpt-oss-20b', {
device: 'webgpu', // GPU acceleration via WebGPU
quantized: true // INT4/INT8 quantization to fit in memory
});
// All inference happens locally
const result = await generator('Analyze this contract clause:', {
max_new_tokens: 500,
temperature: 0.3
});
// result.generated_text contains the output
// Zero network calls were made
The first load downloads the quantized model (roughly 4 to 8 GB depending on quantization level) to the browser's cache storage. After that, the model loads from cache in seconds. All inference runs on the user's GPU through WebGPU.
Where This Changes Everything
Healthcare. HIPAA compliance becomes trivial when patient data never leaves the device. Medical note summarization, preliminary triage analysis, clinical documentation: all running locally.
Legal. Attorney-client privilege is easier to maintain when your AI tool doesn't send documents to a cloud API. Contract analysis, case research, document review: client-side.
Finance. Transaction analysis, fraud pattern detection, customer data processing: no PII (personally identifiable information) ever touches an external server.
Education. Student writing assistance without essays being sent to external training pipelines. A real concern that's driven several school districts away from cloud AI tools.
The Honest Limitations
This isn't a replacement for cloud APIs in every scenario.
First-load pain. Downloading 4 to 8 GB on a slow connection is rough. Your UX needs to handle this gracefully with progress indicators and caching status.
Hardware floor. You need a modern GPU. Integrated graphics work for smaller models, but the 20B model wants at least 8 GB of system RAM and a discrete GPU for reasonable speed.
Model size ceiling. Browser-based inference tops out around 20B parameters with current quantization. If you need 70B+ model capabilities, you still need a server.
Generation speed. Client-side inference is slower than a cloud GPU cluster. For real-time chat applications, expect noticeable latency on consumer hardware.
Start small. Try 3B or 7B models first. Validate the UX pattern with your users. Scale up to 20B only for use cases that justify the download size.
Resources
- ▹Transformers.js v4 documentation
- ▹ONNX Runtime Web tutorials
- ▹WebGPU fundamentals
- ▹Original r/LocalLLaMA post
▸ The 20x Question
MiniMax released M2.5 this week. The numbers: 230B total parameters, 10B active via mixture-of-experts (MoE) architecture, fully open weights on HuggingFace. Performance on SWE-Bench Verified: 80.2%. Claude Opus 4.6 scores 80.8%.
That's a 0.6% gap. At 1/20th the cost.
Then Alibaba dropped Qwen 3.5: 397B parameters (17B active via MoE), native vision-language capabilities, designed for agentic tasks. Open weights expected within hours of this writing.
What Changed
Twelve months ago, the conversation was "open models are good enough for prototyping." Today, open models are competitive with the best closed models on real-world coding benchmarks. Not "close enough." Competitive.
The MoE architecture is the technical breakthrough driving this. Instead of activating all 230B parameters for every token, M2.5 activates only 10B. You get the knowledge of a massive model with the compute cost of a small one.
The Real Economics
Let's do the math on a typical RAG (retrieval-augmented generation) application processing 100,000 queries per day:
Cloud API (Claude Opus): ~$3,000 to $5,000/month depending on token usage.
Self-hosted MiniMax M2.5: Hardware cost of roughly $2,000 to $4,000/month for GPU instances, but no per-token charges. At high volume, self-hosting wins. At low volume, APIs are still cheaper because you avoid the fixed infrastructure cost.
The crossover point for most applications is somewhere around 50,000 to 100,000 queries per day. Below that, APIs are more economical. Above that, self-hosting saves money every month.
What You Should Actually Do
Don't rip out your API calls tomorrow. Instead:
- ▹
Benchmark your specific use case. SWE-Bench is a coding benchmark. Your application might have different requirements where model differences matter more or less.
- ▹
Calculate your actual costs. Pull your API bills for the last 3 months. Estimate what self-hosting would cost for the same volume. Find your crossover point.
- ▹
Run a parallel evaluation. Send the same queries to both your current API and an open model. Compare output quality on YOUR data, not benchmarks.
- ▹
Consider hybrid approaches. Use open models for high-volume, latency-tolerant tasks. Keep API access for complex reasoning where the last 0.6% matters.
The question isn't "should I switch?" It's "where in my stack does the 20x cost premium stop being justified?"
▸ The AI Fear Trade
Markets have wiped $1.35 trillion from the Big Six AI companies in recent weeks. A $6 million karaoke company rebranded as an AI freight platform and cratered an entire sector of trucking stocks. Baker McKenzie, one of the world's largest law firms, is cutting roughly 1,000 jobs citing AI automation.
Meanwhile, Scale AI's new Remote Labor Index tested frontier models against real paid remote work. The best model's failure rate: 96.25%.
That's a contradiction worth sitting with. Markets are pricing in massive AI disruption. But when you test AI on actual work (not benchmarks), it fails almost everything.
What's Actually Happening
Both things are true at once. AI is genuinely disrupting certain workflows (code generation, content drafting, data analysis). And AI is genuinely terrible at most complete job functions (client relationships, nuanced judgment, cross-functional coordination).
The gap between "AI can help with parts of your job" and "AI can replace your job" is enormous. Markets are pricing in the latter. Reality is delivering the former.
For engineers: your job security is proportional to your AI fluency. Not because AI will replace you, but because engineers who use AI effectively are 2 to 3x more productive than those who don't. Companies will choose the productive ones.
For leaders: the "burning platform" panic is real, but the solution isn't "deploy AI everywhere." It's "figure out which 15% of workflows AI actually improves, and nail those."
Simon Willison coined two terms this week that capture the moment perfectly. "Deep Blue": the existential dread developers feel watching LLMs write code. "Cognitive debt": the accumulated risk of unreviewed AI-generated code in your codebase. Both are real. Both need management, not denial.
▸ Key Takeaways
- ▹
Test your agents now. Khaos broke every agent it tested. Yours probably isn't different. Add security testing to your CI/CD pipeline.
- ▹
Browser-native AI is production-ready for privacy-sensitive use cases. The stack (Transformers.js + ONNX + WebGPU) works today in Chrome and Edge.
- ▹
Open models hit the crossover point. At 1/20th the cost and 99.4% of the performance, the burden of proof has shifted. You need a reason to stay on closed APIs, not a reason to leave.
- ▹
The fear is ahead of the reality. AI disrupts workflows, not jobs. Build fluency, not bunkers.
- ▹
Own your AI. Whether it's running in a browser tab, self-hosted on your hardware, or tested with your own security tools: the trend is toward control, not dependence.
▸ What to Do Next
This week:
- ▹Run Khaos against any AI agents you have in production
- ▹Calculate your monthly API spend and estimate self-hosting costs
This month:
- ▹Prototype a browser-native AI feature using Transformers.js v4
- ▹Benchmark MiniMax M2.5 or Qwen 3.5 against your current model on your actual data
Resources to bookmark:
- ▹Khaos security testing framework
- ▹Transformers.js v4 docs
- ▹MiniMax M2.5 on HuggingFace
- ▹Scale AI Remote Labor Index
▸ Work With Me
If you're staring at a $5,000/month API bill wondering whether self-hosting makes sense, or deploying agents without knowing if they're secure, those are exactly the problems I help companies solve.
I run AI audits that find the gaps (security, cost, architecture) and build implementation plans that actually ship. Not slide decks. Working systems.
This post is the companion deep-dive to Issue #1 of The Own Your AI Brief. Subscribe here for weekly practical AI engineering tactics delivered to your inbox.
Need AI Strategy That Actually Works?
Let's cut through the noise. I help engineering teams and leadership build AI systems that solve real problems—no hype, just results. From RAG pipelines to production deployments.
Get AI insights delivered
Practical AI engineering tactics. No fluff, no spam.