Day 11: When Claude Max Became a Real Agent

April 1, 2026 14:00Z

Sprint B builds the foundation for Claude Max auth and streaming. Sprint C/D fixes 11 bugs and polishes the agent system. Local model tuning recommendations.

Share

Day 11: When Claude Max Became a Real Agent Saturday and Sunday. Two days where AI agents stopped being templates and started being actual team members. Sprint B shipped the core infrastructure. Sprint C/D polished it into something production-ready. And along the way, I learned exactly why small models fail at agent work. 📖 Build Log Series: Day 0: The Setup · Day 1: First Sprints · Day 2: Six Sprints · Day 3: The Newsletter · Day 4: The Board Meeting · Day 5: The Scaling Week · Day 6: The Week of Infrastructure · Day 7: When an Idea Becomes an Agent · Day 8: The Browser Becomes the Agent · Day 9: When a Design Sprint Meets Real Infrastructure · Day 10: When Infrastructure Becomes the Feature · Day 11: When Claude Max Became a Real Agent ## ▸ ▸ ▸ Saturday 9:00 AM: The Authentication Problem I had a choice to make about Sprint B. Agents in Spark need to connect to Claude. Not through a rate-limited API key. Through a real subscription. Claude Max, specifically. The kind of plan that gives you reasoning, long context, and the ability to run actual work. The question was: how do you authenticate a subscription token without breaking the security model? OpenAI went with OAuth. That's elegant but complicated. You need a whole third-party auth flow. Anthropic went simpler: a REST API with a standard x-api-key header. Just like Stripe or GitHub. Which meant I could use the claude setup-token CLI command to generate a token, store it safely, and pass it directly to the API. The token comes out like this: sk-ant-oat01-.... It's real enough to access /v1/models. Hit that endpoint, and if you see claude-opus-4-6 (Opus 4.6), you have a Max plan. If you only see Sonnet, you're on Pro. I decided: no OAuth. Just the token. Store it encrypted in Spark's database (global scope or project scope). Verify it once. Cache the verification result. Done. By Saturday afternoon, the auth layer was locked. ## ▸ ▸ ▸ Saturday 10:00 AM: Sprint B Ships Sprint B was massive. Five service classes. Two controller classes. One model redesign. Streaming from scratch. Here's what shipped: AgentContextService — Assembles the system prompt from multiple sources. Reads AGENT.md (agent personality). Reads PROJECT_SUMMARY (project context). Pulls live briefings from Obsidian. Includes real-time statistics (open tasks, last deployment, team size). The agent sees everything it needs to make decisions. AgentChatService — Handles streaming to the frontend via Server-Sent Events (SSE). Works with Anthropic, OpenAI, and Ollama. Catches token expiry (401 errors) and emits structured error events that the frontend can handle gracefully. AgentConversation + AgentMessage models — The database schema for storing conversation history. Messages are typed (user, assistant, action, error). They carry metadata (tool calls, streaming tokens, execution results). Everything persists, so conversations survive refreshes. AgentChatController — Four endpoints: - /send (stream messages, consume tools, emit SSE) - /history (fetch full conversation, no filtering) - /status (check if agent is alive, token valid, ready to chat) - /refresh-token (inline token refresh when expired) AgentSetupController — A 4-step wizard: 1. Choose provider (Claude Subscription, Claude API, OpenAI, Ollama) 2. Paste API key or token 3. Select model (grouped by provider) 4. Write agent personality (optional, falls back to default) On submit, the system scaffolds an AGENT.md file in the agent's workspace. That file is the single source of truth for who this agent is. By 6:00 PM Saturday, the entire Sprint B was live in the update/agent-flow branch. ## ▸ ▸ ▸ Saturday 11:00 AM: The Dashboard Redesign I realized something. The dashboard doesn't need to be a dashboard. It needs to be a chat interface. So I rebuilt it. The left side is the agent list (who's online, who's ready, who needs setup). The center is the conversation. Messages stream in real-time. The agent's thinking is visible. Tool calls are shown as action cards that collapse/expand to show the result. The right side is live stats: open tasks, people online, last deployment, agent status, active briefing data. And at the top, if the token expires, there's an inline refresh button. No modal. No page reload. Just click, paste the new token, and keep talking. The whole thing uses PHP's response()->stream() for SSE. Headers are tuned: X-Accel-Buffering: no, Cache-Control: no-cache, Content-Type: text/event-stream. The browser gets chunks in real-time. I tested it with Anthropic's SSE format. Works. Tested with OpenAI's format. Works. Tested with Ollama streaming. Works. By 2:00 PM, the dashboard was live. ## ▸ ▸ ▸ Saturday 1:00 PM: AgentKeySetup Component But there was a problem. Most agents won't have API keys when they start. The wizard gets them the token. But what if the token expires mid-conversation? Or what if someone wants to switch from OpenAI to Claude? I built AgentKeySetup—a component that appears inline in the chat when the agent has no valid API key. No modal. No redirect. Just a card in the conversation that says: > "This agent needs an API key to continue. Paste your key below." You paste it. Click "Save & Boot Agent". The system verifies the key. If valid, it triggers the agent introduction message automatically. If invalid, it shows a friendly error. No friction. No page reloads. ## ▸ ▸ ▸ Saturday 2:00 PM: SpawnAgentModal Next, I built the UI for creating new agents from the chat. SpawnAgentModal is a form that appears when you click "New Agent" from the dashboard. Five fields: 1. Provider (dropdown, grouped by provider: Claude Subscription, Claude API, OpenAI, Ollama) 2. Model (dropdown, filtered by provider, shows tier requirements) 3. Personality (textarea, optional, defaults to CEO/Engineer/Builder) 4. Shell Trust Level (selector: none, readonly, standard, full) 5. Domain Pills (tags for the agent's domain of expertise) On submit, the modal: 1. Validates the model is available for the selected provider 2. Creates an Agent record in the database 3. Scaffolds an AGENT.md file with the personality + system prompt 4. Queues a spawn event 5. Launches the agent 6. Refreshes the agent list and shows the new agent in the tabs All in one flow. No separate configuration page. Just one modal. ## ▸ ▸ ▸ Saturday 3:00 PM: Sprint D Tools Registered Sprint D is about external tools. Git, terminal, SSH, Boost MCP. These are the things agents do when they're running real work. Saturday, I registered all the tool definitions in ToolRegistry: - shell.exec — Run a shell command. Respects trust levels. - git.status — Check git status - git.log — View commit history with filtered depth - git.diff — Show changes between commits - git.branch — List and switch branches - boost.call — Invoke Boost MCP tools (database queries, tinker, etc.) Each tool has a schema. Each tool respects the agent's trust level: - none — No commands allowed. Agent can read from filesystem and chat, but can't execute. - readonly — Read-only commands (git log, git status, ls, cat). No writes. - standard — Most commands allowed, but blocks destructive operations (rm, git push without review, etc.) - full — Everything. Deploy, delete, push. Full access. I built a CommandClassifier that scans the command, categorizes it (blocked, destructive, readonly, standard, unknown), and either executes or returns an error. By 5:00 PM, all Sprint D tools were registered and ready. ## ▸ ▸ ▸ Saturday 4:00 PM: Agent Base Prompt Here's the critical piece: every agent uses the same system prompt foundation. I created resources/prompts/agent-base.md. It's the single source of truth for: - How agents should format tool calls - The difference between wrong and correct tool call syntax - How agents should respond when they don't know something - First-boot rules (introduce yourself, don't call tools on first message) - Response style guidelines This file is prepended to every agent's system prompt. It doesn't matter if the agent is Claude Opus or a small open-source model. They all get the same foundation. Why? Because small models (Qwen 2.5 7B, llama3.2 3B) have a bad habit. They wrap tool calls in code fences: json {"tool": "shell.exec", "command": "ls"} And they use the tool name as the JSON key instead of the "tool" field: json {"shell.exec": {"command": "ls"}} The agent-base.md has a whole section called "WRONG/CORRECT Examples" that shows both formats and says explicitly: "Use this. Not that." I also added a lenient parser in processToolCalls() that strips code fences and tries both formats. If a small model sends wrapped-up JSON, we still parse it. By 6:00 PM, agent-base.md was finalized. 1.3 KB of foundation that powers every agent in the system. ## ▸ ▸ ▸ Saturday Evening: 19 Commits By the end of Saturday, I had 19 commits on the update/agent-flow branch: - bc090de — action bubble optimistic injection (show results immediately while streaming) - 2f66f95 — agent-base.md global prompt - 93e7b0c — WRONG/CORRECT format examples - cb25245 — strip code fences; sanitize heartbeat_mode - 738019b — synthetic user message for first-boot (Ollama compatibility) - ae9a6fd — clearContext calls triggerIntroduction directly - fd77af5 — ensureCeoRow on status; validate localStorage role - 8e4206b — AgentKeySetup friendly error card - 39f8cda — fail fast on missing API key (30s timeout → immediate error) - 92d932b — remember active agent tab via localStorage - d3ea141 — remove debug logging - e598a3e — remove hardcoded in:ceo validation (was blocking all non-CEO agents) - 36495c3 — useRef for triggerIntroduction (closure fix) - 2aaf73d — models in dashboard Inertia props - d6af0d7 — model dropdown in spawn modal - 26e9e19 — first-boot forbids tool calls; spawn param fixes - 84246cb — lenient tool call parser (handles wrapped JSON from small models) - a86dc01 — shell trust levels + CommandClassifier - 68e2e91 — Sprint D tools + workspace_path Sprint B was done. Code was ready. ## ▸ ▸ ▸ Sunday 9:00 AM: Sprint C Begins Sunday was different. Saturday was building. Sunday was fixing. I started the day with a list of bugs. Bug #1: AgentMessage fillable missing type and metadata — Messages were storing the basic stuff (user, content, agent_id) but not tracking what kind of message it was. Is this a user turn? An assistant turn? An action result? The database had no way to know. Fix: Added type (enum: user, assistant, action, error) and metadata (JSON, stores tool calls, errors, context). Migrations created, models updated. But there was a gotcha. The PHP code that writes AGENT.md files was using $ variables inside double quotes. PHP thinks that's a variable interpolation. So "$model" becomes blank. I had to switch to file writes instead of template rendering. Just write the content directly. Bug #2: History endpoint was stripping fields — The getHistory() method on AgentConversation was built for frontend display. It didn't include metadata or message type. So when you refreshed, the agent saw a sanitized version of the conversation, not the real one. Fix: Query the raw database instead. Load full fields. Metadata, type, everything. Bug #3: switchAgent history map was old — When you switched between agents in the UI, it was still using the old stripped history. You'd switch to Agent B, see Agent A's conversation, see it's stripped, get confused. Fix: Point switchAgent to the same raw query as the history endpoint. Bug #4: Hardcoded in:ceo validation blocked all non-CEO agents — The chat endpoint had a validation rule that said "only messages from CEO agents allowed." This was from testing. I forgot to remove it. Fix: Change to alpha_dash (standard Laravel validation). Any agent slug is allowed. Bug #5: triggerIntroduction stale closure — The JavaScript code had a callback that tried to call triggerIntroduction(). But the function reference was captured in a closure that didn't get updated when the agent changed. You'd switch to Agent B, click "Introduce Yourself", and Agent A's intro would run. Fix: Refactor to use useRef. The ref points to the live function, not a captured snapshot. Bug #6: clearContext didn't re-introduce — Clearing the context should start fresh. But the conversation thread stayed open. The agent wouldn't introduce itself. Fix: clearContext() now calls triggerIntroduction() directly with await. Full reset. Bug #7: Ollama returns empty response for system-only context — If you send a message with only a system prompt (no user turn), Ollama returns null. The chat stops. Fix: Inject a synthetic user message for first-boot: "Please introduce yourself." This primes the pump. Bug #8: Action bubble showed on refresh but not during stream — When you reloaded, you saw the tool result. But while the stream was happening, the result card didn't appear until after the stream finished. Fix: Inject action bubbles optimistically during SSE. Show the result while streaming. When the final stream chunk arrives, overwrite with the real result. Bug #9: Duplicate action bubble render block — Someone (probably me) left two copies of the action bubble render code. Rendering twice per message. Fix: Delete the duplicate. Bug #10: CEO tab disappeared after agent launch — You'd click "Create New Agent". The modal validates. The agent spawns. The tab should appear in the tab bar. But it didn't. You had to reload to see it. Fix: After spawn, reload the full agents list from the API instead of just appending the new agent. The API response has all the state we need. Bug #11: localStorage role validation failed — localStorage role reference was from another Spark project. When you switched projects, the role didn't exist. The agent picker crashed. Fix: Validate localStorage role against the current project's agents list. If it doesn't exist, reset to CEO. All 11 bugs fixed by 2:00 PM Sunday. ## ▸ ▸ ▸ Sunday 3:00 PM: The Model Behavior Notes While testing the fixes, I noticed something interesting about small models. Qwen 2.5 7B and llama3.2 3B both run locally on my M4 Mac. They're fast. They're cheap. But they have consistent failure patterns when it comes to tool use. Pattern #1: Code fence wrapping Both models wrap tool calls in Markdown code blocks: json {"tool": "shell.exec", "command": "ls"} It looks pretty when you print it. It's also completely wrong for parsing. Pattern #2: Wrong JSON key format Instead of: json {"tool": "agents.spawn", "params": {...}} They write: json {"agents.spawn": {"params": {...}}} The tool name becomes the key. That's intuitive to a human reader but breaks JSON parsing. Pattern #3: Hallucinated model names Ask them what models they can call, and they invent names: gpt-4, gpt-3.5-turbo. Models that don't exist in their toolkit. Pattern #4: Personality object hallucination Tell them the personality is a string. They respond with {"personality": {"traits": ["helpful", "expert"]}} anyway. Objects instead of strings. All of this is fixable with: 1. Explicit WRONG/CORRECT examples in the system prompt (agent-base.md) 2. A lenient parser that tries both formats 3. Model list injection into the tool schema 4. Personality description that says "string—NOT an object" But here's the real lesson: small models need more scaffolding. They need guard rails. They need examples that spell out "don't do this" as clearly as "do this." If you're going to use a small model for agent work, budget 2-3 days for prompt tuning. By 4:00 PM, I had notes on what works and what doesn't. ## ▸ ▸ ▸ Sunday 4:00 PM: Local Model Recommendation For the meta-ray-ban-rd project (AR glasses hardware integration), I evaluated three local models: - Qwen 2.5 7B — Fast, but tool use is fragile. Hallucinations on model names. - llama3.2 3B — Smaller, faster, but even more hallucinations. Not recommended for tool-heavy work. - llama3.1 8B — Meta-trained for tool use. Better instruction following. More reliable. - Qwen 2.5 14B — Larger, more capable, needs 14GB VRAM. On a Mac Mini, doable. Not on my M4 laptop. Recommendation: Switch to llama3.1:8b for the AR glasses project. It's trained for tool use. It respects instructions. It handles JSON parsing better than Qwen. The trade-off is speed. llama3.1 8B is slower than Qwen 2.5 7B. But for AR glasses work, correctness beats speed. By 5:00 PM Sunday, the recommendation was documented. ## ▸ ▸ ▸ Sunday 5:00 PM: ShopStable Sprints 81-82 Context switch. ShopStable is a different codebase. Different language (Laravel instead of a Spark context). Different architecture. Different problem. Sprints 81-82 shipped: Storefront Subdomain Feature. The idea: each seller in ShopStable gets their own subdomain. brian.shopstable.com. felix.shopstable.com. Instead of /storefront/brian. This required: 1. Wildcard DNS (*.shopstable.com) 2. Certificate abstraction (Let's Encrypt wildcard cert) 3. Subdomain routing in Livewire 4. Database schema for seller subdomains 5. 27 tasks. 9 audit fixes. Brian (the ShopStable lead, not me) finished the implementation Sunday evening. Still testing locally. The feature works. Alpine reactivity has a known double-debounce issue on the subdomain availability check, but that's polish, not broken. By 8:00 PM Sunday, Sprints 81-82 were code-complete. Ready for staging. ## ▸ ▸ ▸ Sunday 8:00 PM: Consulting Intel I spent the evening gathering intel on the AI consulting market. Three patterns emerged: Pattern #1: Regulated SMBs hard-require hybrid cloud AI Healthcare, Finance, Legal. They can't send data to public APIs. They need private deployments. This is a direct service play. You deploy, you host, you manage. Not a product, but a service. Margin is high. Switching costs are high. But sales cycles are long (6-12 months for enterprise deals). Pattern #2: Incremental pilot adoption is winning Nobody wants a big-bang AI overhaul. They want 1-2 use cases. Let's try customer support automation. Let's see if it works. Then next quarter, let's add email triage. This means products need to be modular. You shouldn't sell "AI platform." You should sell "customer support AI" or "document processing AI." Then upsell into the platform. Pattern #3: Palantir + Bain are bundling They're pairing data integration with consulting. The bundle is hard to compete with. Our counter: regulatory depth + narrow focus. We're not generalists. We're specialists in healthcare/finance/legal AI. We know the compliance, the data handling, the audit trails. All of this feeds into the April roadmap. Less broad platform. More vertical focus. More consulting plays. ## ▸ ▸ ▸ Sunday Midnight: What I Learned Two days of building taught me three things. First: Authentication is simpler than it looks I was overthinking OAuth. Claude Max authentication is just a REST API. You get a token. You validate it once. You use it. Done. The complexity is in what you do with that token, not in getting it. Second: Small models need scaffolding llama3.2 3B is fast. But it's like a junior engineer who needs constant direction. "Here's the wrong way. Here's the right way. Do the right way." Qwen 2.5 7B is better but still needs examples. llama3.1 8B is trained for tool use and respects instructions. You get what you pay for. Third: Agent infrastructure is team infrastructure This mirrors Day 9's lesson from ShopStable. Design sprints are fast. Agent design is fast. But making agents work as real team members requires: - Persistent storage (conversations, context, history) - Real-time communication (SSE streaming) - Error handling (token expiry, API failures) - Trust levels (shell access, destructive commands) - Scaffolding (base prompts, tool registries, personality files) It's not just code. It's the whole operating system around the code. Sprint B shipped the foundation. Sprint C/D polished it. But the real test is Monday, when agents start doing actual work. --- ## Technical Details (For the Curious) - AgentContextService: Reads 5 sources (AGENT.md, PROJECT_SUMMARY, briefings, stats, live data). Assembles 2-3KB system prompt. No redundancy. - SSE Streaming: PHP response()->stream(). Anthropic/OpenAI/Ollama provider support. Headers: X-Accel-Buffering: no, Cache-Control: no-cache, Content-Type: text/event-stream. - Models config: subscription/claude-opus-4-6 (Max only), subscription/claude-sonnet-4-6, subscription/claude-haiku-4-5-20251001. api_model field strips provider prefix. - Token verification: Hit /v1/models. If Opus accessible = Max plan. If only Sonnet = Pro. Store encrypted in Setting (global) or project_ai_configs.encrypted_api_key (project scope). - Shell trust levels: none (no execution), readonly (git log, ls, cat), standard (most commands, blocks rm/git push), full (everything). - CommandClassifier: Scans command, categorizes as blocked/destructive/readonly/standard/unknown. Executes or returns error. - Agent-base.md: 1.3KB foundation prepended to every agent prompt. WRONG/CORRECT examples. First-boot rules. Tool format specs. - Commits: 19 on update/agent-flow branch (bc090de through 68e2e91). - Small model recommendations: llama3.1:8b (tool-trained) > qwen2.5:14b (larger) > qwen2.5:7b (fast but fragile) > llama3.2:3b (too small for tool work). - ShopStable: Sprints 81-82 complete. Wildcard DNS. Subdomain routing. 27 tasks. Alpha reactivity known issue on availability check. ## Series Notes - Day 10 shipped the infrastructure for Huly and ShopStable team collaboration. Today, that infrastructure became the foundation for agent collaboration. - Day 8 introduced agents as browser extensions. Today, agents became team members with persistence, tools, and trust levels. - The recurring pattern: Design is theory. Infrastructure is practice. Both need to happen before the thing works. - Next up: Sprint E (skills + heartbeat scheduler) and full production testing of the agent system. - Open questions: Can llama3.1:8b handle complex reasoning with tool chains? How many agents can one Spark project handle before performance degrades? Does the localStorage agent switching work at scale? - April window: Consulting velocity peak before Q2 budget reset. Use this month to prove the agent system on a real client.

Share
Strategic Intelligence

Need AI Strategy That Actually Works?

Let's cut through the noise. I help engineering teams and leadership build AI systems that solve real problems—no hype, just results. From RAG pipelines to production deployments.

Open Channelâ–¸ Free initial consultation
Intelligence Brief

Get AI insights delivered

Practical AI engineering tactics. No fluff, no spam.

End of Transmission
View More Intel