AI Strategy - Tygart Media

Category: AI Strategy

AI strategy for operators: deploy Claude, automate real workflows, and build AI-native systems that compound. Field notes and playbooks from Tygart Media.

  • Claude Managed Agents vs. Rolling Your Own: The Real Infrastructure Build Cost

    Claude Managed Agents vs. Rolling Your Own: The Real Infrastructure Build Cost

    Tygart Media Strategy
    Volume Ⅰ · Issue 04Quarterly Position
    By Will Tygart
    Long-form Position
    Practitioner-grade

    Claude Managed Agents vs. Rolling Your Own: The Real Infrastructure Build Cost

    The Build-vs-Buy Question: Claude Managed Agents offers hosted AI agent infrastructure at $0.08/session-hour plus token costs. Rolling your own means engineering sandboxed execution, state management, checkpointing, credential handling, and error recovery yourself — typically months of work before a single production agent runs.

    Every developer team that wants to ship a production AI agent faces the same decision point: build your own infrastructure or use a managed platform. Anthropic’s April 2026 launch of Claude Managed Agents made that decision significantly harder to default your way through.

    This isn’t a “managed is always better” argument. There are legitimate reasons to build your own. But the build cost needs to be reckoned with honestly — and most teams underestimate it substantially.

    What You Actually Have to Build From Scratch

    The minimum viable production agent infrastructure requires solving several distinct problems, none of which are trivial.

    Sandboxed execution: Your agent needs to run code in an isolated environment that can’t access systems it isn’t supposed to touch. Building this correctly — with proper isolation, resource limits, and cleanup — is a non-trivial systems engineering problem. Cloud providers offer primitives (Cloud Run, Lambda, ECS), but wiring them into an agent execution model takes real work.

    Session state and context management: An agent working on a multi-step task needs to maintain context across tool calls, handle context window limits gracefully, and not drop state when something goes wrong. Building reliable state management that works at production scale typically takes several engineering iterations to get right.

    Checkpointing: If your agent crashes at step 11 of a 15-step job, what happens? Without checkpointing, the answer is “start over.” Building checkpointing means serializing agent state at meaningful intervals, storing it durably, and writing recovery logic that knows how to resume cleanly. This is one of the harder infrastructure problems in agent systems, and most teams don’t build it until they’ve lost work in production.

    Credential management: Your agent will need to authenticate with external services — APIs, databases, internal tools. Managing those credentials securely, rotating them, and scoping them properly to each agent’s permissions surface is an ongoing operational concern, not a one-time setup.

    Tool orchestration: When Claude calls a tool, something has to handle the routing, execute the tool, handle errors, and return results in the right format. This orchestration layer seems simple until you’re debugging why tool call 7 of 12 is failing silently on certain inputs.

    Observability: In production, you need to know what your agents are doing, why they’re doing it, and when they fail. Building logging, tracing, and alerting for an agent system from scratch is a non-trivial DevOps investment.

    Anthropic’s stated estimate is that shipping production agent infrastructure takes months. That tracks with what we’ve seen in practice. It’s not months of full-time work for a large team — but it’s months of the kind of careful, iterative infrastructure engineering that blocks product work while it’s happening.

    What Claude Managed Agents Provides

    Claude Managed Agents handles all of the above at the platform level. Developers define the agent’s task, tools, and guardrails. The platform handles sandboxed execution, state management, checkpointing, credential scoping, tool orchestration, and error recovery.

    The official API documentation lives at platform.claude.com/docs/en/managed-agents/overview. Agents can be deployed via the Claude console, Claude Code CLI, or the new agents CLI. The platform supports file reading, command execution, web browsing, and code execution as built-in tool capabilities.

    Anthropic describes the speed advantage as 10x — from months to weeks. Based on the infrastructure checklist above, that’s believable for teams starting from zero.

    The Honest Case for Rolling Your Own

    There are real reasons to build your own agent infrastructure, and they shouldn’t be dismissed.

    Deep customization: If your agent architecture has requirements that don’t fit the Managed Agents execution model — unusual tool types, proprietary orchestration patterns, specific latency constraints — you may need to own the infrastructure to get the behavior you need.

    Cost at scale: The $0.08/session-hour pricing is reasonable for moderate workloads. At very high scale — thousands of concurrent sessions running for hours — the runtime cost becomes a significant line item. Teams with high-volume workloads may find that the infrastructure engineering investment pays back faster than they expect.

    Vendor dependency: Running your agents on Anthropic’s managed platform means your production infrastructure depends on Anthropic’s uptime, their pricing decisions, and their roadmap. Teams with strict availability requirements or long-term cost predictability needs have legitimate reasons to prefer owning the stack.

    Compliance and data residency: Some regulated industries require that agent execution happen within specific geographic regions or within infrastructure that the company directly controls. Managed cloud platforms may not satisfy those requirements.

    Existing investment: If your team has already built production agent infrastructure — as many teams have over the past two years — migrating to Managed Agents requires re-architecting working systems. The migration overhead is real, and “it works” is a strong argument for staying put.

    The Decision Framework

    The practical question isn’t “is managed better than custom?” It’s “what does my team’s specific situation call for?”

    Teams that haven’t shipped a production agent yet and don’t have unusual requirements should strongly consider starting with Managed Agents. The infrastructure problems it solves are real, the time savings are significant, and the $0.08/hour cost is unlikely to be the deciding factor at early scale.

    Teams with existing agent infrastructure, high-volume workloads, or specific compliance requirements should evaluate carefully rather than defaulting to migration. The right answer depends heavily on what “working” looks like for your specific system.

    Teams building on Claude Code specifically should note that Managed Agents integrates directly with the Claude Code CLI and supports custom subagent definitions — which means the tooling is designed to fit developer workflows rather than requiring a separate management interface.

    Frequently Asked Questions

    How long does it take to build production AI agent infrastructure from scratch?

    Anthropic estimates months for a full production-grade implementation covering sandboxed execution, checkpointing, state management, credential handling, and observability. The actual time depends heavily on team experience and specific requirements.

    What does Claude Managed Agents handle that developers would otherwise build themselves?

    Sandboxed code execution, persistent session state, checkpointing, scoped permissions, tool orchestration, context management, and error recovery — the full infrastructure layer underneath agent logic.

    At what scale does it make sense to build your own agent infrastructure vs. using Claude Managed Agents?

    There’s no universal threshold, but the $0.08/session-hour pricing becomes a significant cost factor at thousands of concurrent long-running sessions. Teams should model their expected workload volume before assuming managed is cheaper than custom at scale.

    Can Claude Managed Agents work with Claude Code?

    Yes. Managed Agents integrates with the Claude Code CLI and supports custom subagent definitions, making it compatible with developer-native workflows.


    Related: Complete Pricing Reference — every variable in one place. Complete FAQ Hub — every question answered.

  • Claude Managed Agents Enterprise Deployment: What Rakuten’s 5-Department Rollout Actually Cost

    Claude Managed Agents Enterprise Deployment: What Rakuten’s 5-Department Rollout Actually Cost

    Tygart Media Strategy
    Volume Ⅰ · Issue 04Quarterly Position
    By Will Tygart
    Long-form Position
    Practitioner-grade

    Rakuten Stood Up 5 Enterprise Agents in a Week. Here’s What Claude Managed Agents Actually Does

    Claude Managed Agents for Enterprise: A cloud-hosted platform from Anthropic that lets enterprise teams deploy AI agents across departments — product, sales, HR, finance, marketing — without building backend infrastructure. Agents plug directly into Slack, Teams, and existing workflow tools.

    When Rakuten announced it had deployed enterprise AI agents across five departments in a single week using Anthropic’s newly launched Claude Managed Agents, it wasn’t a headline about AI being impressive. It was a headline about deployment speed becoming a competitive variable.

    A week. Five departments. Agents that plug into Slack and Teams, accept task assignments, and return deliverables — spreadsheets, slide decks, reports — to the people who asked for them.

    That timeline matters. It used to take enterprise teams months to do what Rakuten did in days. Understanding what changed is the whole story.

    What Enterprise AI Deployment Used to Look Like

    Before managed infrastructure existed, deploying an AI agent in an enterprise environment meant building a significant amount of custom scaffolding. Teams needed secure sandboxed execution environments so agents could run code without accessing sensitive systems. They needed state management so a multi-step task didn’t lose its progress if something failed. They needed credential management, scoped permissions, and logging for compliance. They needed error recovery logic so one bad API call didn’t collapse the whole job.

    Each of those is a real engineering problem. Combined, they typically represented months of infrastructure work before a single agent could touch a production workflow. Most enterprise IT teams either delayed AI agent adoption or deprioritized it entirely because the upfront investment was too high relative to uncertain ROI.

    What Claude Managed Agents Changes for Enterprise Teams

    Anthropic’s Claude Managed Agents, launched in public beta on April 9, 2026, moves that entire infrastructure layer to Anthropic’s platform. Enterprise teams now define what the agent should do — its task, its tools, its guardrails — and the platform handles everything underneath: tool orchestration, context management, session persistence, checkpointing, and error recovery.

    The result is what Rakuten demonstrated: rapid, parallel deployment across departments with no custom infrastructure investment per team.

    According to Anthropic, the platform reduces time from concept to production by up to 10x. That claim is supported by the adoption pattern: companies are not running pilots, they’re shipping production workflows.

    How Enterprise Teams Are Using It Right Now

    The enterprise use cases emerging from the April 2026 launch tell a consistent story — agents integrated directly into the communication and workflow tools employees already use.

    Rakuten deployed agents across product, sales, marketing, finance, and HR. Employees assign tasks through Slack and Teams. Agents return completed deliverables. The interaction model is close to what a team member experiences delegating work to a junior analyst — except the agent is available 24 hours a day and doesn’t require onboarding.

    Asana built what they call AI Teammates — agents that operate inside project management workflows, picking up assigned tasks and drafting deliverables alongside human team members. The distinction here is that agents aren’t running separately from the work — they’re participants in the same project structure humans use.

    Notion deployed Claude directly into workspaces through Custom Agents. Engineers use it to ship code. Knowledge workers use it to generate presentations and build internal websites. Multiple agents can run in parallel on different tasks while team members collaborate on the outputs in real time.

    Sentry took a developer-specific angle — pairing their existing Seer debugging agent with a Claude-powered counterpart that writes patches and opens pull requests automatically when bugs are identified.

    What Enterprise IT Teams Are Actually Evaluating

    The questions enterprise IT and operations leaders should be asking about Claude Managed Agents are different from what a developer evaluating the API would ask. For enterprise teams, the key considerations are:

    Governance and permissions: Claude Managed Agents includes scoped permissions, meaning each agent can be configured to access only the systems it needs. This is table stakes for enterprise deployment, and Anthropic built it into the platform rather than leaving it to each team to implement.

    Compliance and logging: Enterprises in regulated industries need audit trails. The managed platform provides observability into agent actions, which is significantly harder to implement from scratch.

    Integration with existing tools: The Rakuten and Asana deployments demonstrate that agents can integrate with Slack, Teams, and project management tools. This matters because enterprise AI adoption fails when it requires employees to change their workflow. Agents that meet employees where they already work have a fundamentally higher adoption ceiling.

    Failure recovery: Checkpointing means a long-running enterprise workflow — a quarterly report compilation, a multi-system data aggregation — can resume from its last saved state rather than restarting entirely if something goes wrong. For enterprise-scale jobs, this is the difference between a recoverable error and a business disruption.

    The Honest Trade-Off

    Moving to managed infrastructure means accepting certain constraints. Your agents run on Anthropic’s platform, which means you’re dependent on their uptime, their pricing changes, and their roadmap decisions. Teams that have invested in proprietary agent architectures — or who have compliance requirements that preclude third-party cloud execution — may find Managed Agents unsuitable regardless of its technical merits.

    The $0.08 per session-hour pricing, on top of standard token costs, also requires careful modeling for enterprise workloads. A suite of agents running continuously across five departments could accumulate meaningful runtime costs that need to be accounted for in technology budgets.

    That said, for enterprise teams that haven’t yet deployed AI agents — or who have been blocked by infrastructure cost and complexity — the calculus has changed. The question is no longer “can we afford to build this?” It’s “can we afford not to deploy this?”

    Frequently Asked Questions

    How quickly can an enterprise team deploy agents with Claude Managed Agents?

    Rakuten deployed agents across five departments — product, sales, marketing, finance, and HR — in under a week. Anthropic claims a 10x reduction in time-to-production compared to building custom agent infrastructure.

    What enterprise tools do Claude Managed Agents integrate with?

    Deployed agents can integrate with Slack, Microsoft Teams, Asana, Notion, and other workflow tools. Agents accept task assignments through these platforms and return completed deliverables directly in the same environment.

    How does Claude Managed Agents handle enterprise security requirements?

    The platform includes scoped permissions (limiting each agent’s system access), observability and logging for audit trails, and sandboxed execution environments that isolate agent operations from sensitive systems.

    What does Claude Managed Agents cost for enterprise use?

    Pricing is standard Anthropic API token rates plus $0.08 per session-hour of active runtime. Enterprise teams with multiple agents running across departments should model their expected monthly runtime to forecast costs accurately.


    Related: Complete Pricing Reference — every variable in one place. Complete FAQ Hub — every question answered.

  • Anthropic Launched Managed Agents. Here’s How We Looked at It — and Why We’re Staying Our Course.

    Anthropic Launched Managed Agents. Here’s How We Looked at It — and Why We’re Staying Our Course.

    Tygart Media Strategy
    Volume Ⅰ · Issue 04Quarterly Position
    By Will Tygart
    Long-form Position
    Practitioner-grade

    Anthropic Launched Managed Agents. Here’s How We Looked at It — and Why We’re Staying Our Course.

    What Are Claude Managed Agents? Anthropic’s Claude Managed Agents is a cloud-hosted infrastructure service launched April 9, 2026, that lets developers and businesses deploy AI agents without building their own execution environments, state management, or orchestration systems. You define the task and tools; Anthropic runs the infrastructure.

    On April 9, 2026, Anthropic announced the public beta of Claude Managed Agents — a new infrastructure layer on the Claude Platform designed to make AI agent deployment dramatically faster and more stable. According to Anthropic, it reduces build and deployment time by up to 10x. Early adopters include Notion, Asana, Rakuten, and Sentry.

    We looked at it. Here’s what it is, how it compares to what we’ve built, and why we’re continuing on our own path — at least for now.

    What Is Anthropic Managed Agents?

    Claude Managed Agents is a suite of APIs that gives development teams fully managed, cloud-hosted infrastructure for running AI agents at scale. Instead of building secure sandboxes, managing session state, writing custom orchestration logic, and handling tool execution errors yourself, Anthropic’s platform does it for you.

    The key capabilities announced at launch include:

    • Sandboxed code execution — agents run in isolated, secure environments
    • Persistent long-running sessions — agents stay alive across multi-step tasks without losing context
    • Checkpointing — if an agent job fails mid-run, it can resume from where it stopped rather than restarting
    • Scoped permissions — fine-grained control over what each agent can access
    • Built-in authentication and tool orchestration — the platform handles the plumbing between Claude and the tools it uses

    Pricing is straightforward: you pay standard Anthropic API token rates plus $0.08 per session-hour of active runtime, measured in milliseconds.

    Why It’s a Legitimate Signal

    The companies Anthropic named as early adopters aren’t small experiments. Notion, Asana, Rakuten, and Sentry are running production workflows at scale — code automation, HR processes, productivity tooling, and finance operations. When teams at that level migrate to managed infrastructure instead of building their own, it suggests the platform has real stability behind it.

    The checkpointing feature in particular stands out. One of the most painful failure modes in long-running AI pipelines is a crash at step 14 of a 15-step job. You lose everything and start over. Checkpointing solves that problem at the infrastructure level, which is the right place to solve it.

    Anthropic’s framing is also pointed directly at enterprise friction: the reason companies don’t deploy agents faster isn’t Claude’s capabilities — it’s the scaffolding cost. Managed Agents is an explicit attempt to remove that friction.

    What We’ve Built — and Why It Works for Us

    At Tygart Media, we’ve been running our own agent stack for over a year. What started as a set of Claude prompts has evolved into a full content and operations infrastructure built on top of the Claude API, Google Cloud Platform, and WordPress REST APIs.

    Here’s what our stack actually does:

    • Content pipelines — We run full article production pipelines that write, SEO-optimize, AEO-optimize, GEO-optimize, inject schema markup, assign taxonomy, add internal links, run quality gates, and publish — all in a single session across 20+ WordPress sites.
    • Batch draft creation — We generate 15-article batches with persona-targeting and variant logic without manual intervention.
    • Cross-site content strategy — Agents scan multiple sites for authority pages, identify linking opportunities, write locally-relevant variants, and publish them with proper interlinking.
    • Image pipelines — End-to-end image processing: generation via Vertex AI/Imagen, IPTC/XMP metadata injection, WebP conversion, and upload to WordPress media libraries.
    • Social media publishing — Content flows from WordPress to Metricool for LinkedIn, Facebook, and Google Business Profile scheduling.
    • GCP proxy routing — A Cloud Run proxy handles WordPress REST API calls to avoid IP blocking across different hosting environments (SiteGround, WP Engine, Flywheel, Apache/ModSecurity).

    This infrastructure took time to build. But it’s purpose-built for our specific workflows, our sites, and our clients. It knows which sites route through the GCP proxy, which need a browser User-Agent header to pass ModSecurity, and which require a dedicated Cloud Run publisher. That specificity has real value.

    Where Managed Agents Is Compelling — and Where It Isn’t (Yet)

    If we were starting from zero today, Managed Agents would be worth serious evaluation. The session persistence and checkpointing would immediately solve the two biggest failure modes we’ve had to engineer around manually.

    But migrating an existing stack to Managed Agents isn’t a lift-and-shift. Our pipelines are tightly integrated with GCP infrastructure, custom proxy routing, WordPress credential management, and Notion logging. Re-architecting that to run inside Anthropic’s managed environment would be a significant project — with no clear gain over what’s already working.

    The $0.08/session-hour pricing also adds up quickly on batch operations. A 15-article pipeline running across multiple sites for two to three hours could add meaningful cost on top of already-substantial token usage.

    For teams that haven’t built their own agent infrastructure yet — especially enterprise teams evaluating AI for the first time — Managed Agents is probably the right starting point. For teams that already have a working stack, the calculus is different.

    What We’re Watching

    We’re treating this as a signal, not an action item. A few things would change that:

    • Native integrations — If Managed Agents adds direct integrations with WordPress, Metricool, or GCP services, the migration case gets stronger.
    • Checkpointing accessibility — If we can use checkpointing on top of our existing API calls without fully migrating, that’s an immediate win worth pursuing.
    • Pricing at scale — Volume discounts or enterprise pricing would change the batch job math significantly.
    • MCP interoperability — Managed Agents running with Model Context Protocol support would let us plug our existing skill and tool ecosystem in without a full rebuild.

    The Bigger Picture

    Anthropic launching managed infrastructure is the clearest sign yet that the AI industry has moved past the “what can models do” question and into the “how do you run this reliably at scale” question. That’s a maturity marker.

    The same shift happened with cloud computing. For a while, every serious technology team ran its own servers. Then AWS made the infrastructure layer cheap enough and reliable enough that it only made sense to build it yourself if you had very specific requirements. We’re not there yet with AI agents — but Anthropic is clearly pushing in that direction.

    For now, we’re watching, benchmarking, and continuing to run our own stack. When the managed layer offers something we can’t build faster ourselves, we’ll move. That’s the right framework for evaluating any infrastructure decision.

    Frequently Asked Questions

    What is Anthropic Managed Agents?

    Claude Managed Agents is a cloud-hosted AI agent infrastructure service from Anthropic, launched in public beta on April 9, 2026. It provides persistent sessions, sandboxed execution, checkpointing, and tool orchestration so teams can deploy AI agents without building their own backend infrastructure.

    How much does Claude Managed Agents cost?

    Pricing is based on standard Anthropic API token costs plus $0.08 per session-hour of active runtime, measured in milliseconds.

    Who are the early adopters of Claude Managed Agents?

    Anthropic named Notion, Asana, Rakuten, Sentry, and Vibecode as early users, deploying the service for code automation, productivity workflows, HR processes, and finance operations.

    Is Anthropic Managed Agents worth switching to if you already have an agent stack?

    It depends on your existing infrastructure. For teams starting fresh, it removes significant scaffolding cost. For teams with mature, purpose-built pipelines already running on GCP or other cloud infrastructure, the migration overhead may outweigh the benefits in the short term.

    What is checkpointing in Managed Agents?

    Checkpointing allows a long-running agent job to resume from its last saved state if it encounters an error, rather than restarting the entire task from the beginning. This is particularly valuable for multi-step batch operations.


    Related: Complete Pricing Reference — every variable in one place. Complete FAQ Hub — every question answered.

  • Wire and Fire Guys: The AI Job Title That Doesn’t Exist Yet

    Wire and Fire Guys: The AI Job Title That Doesn’t Exist Yet

    Tygart Media Strategy
    Volume Ⅰ · Issue 04Quarterly Position
    By Will Tygart
    Long-form Position
    Practitioner-grade

    Before “vibe coding” had a name, Munters had a name for the people who could do it: wire and fire guys. They’re about to be the most valuable humans in the AI era — and I finally found mine.

    The Wire and Fire Guy

    At Munters — which later became Polygon when Triton spun the moisture control services division out in 2010 — there was a specific kind of person the company was built around. We called them wire and fire guys.

    A wire and fire guy could fly into a job site cold. Meet a pile of equipment on a loading dock. Start the generator. Set up the desiccant. Run the lines. Wire in the remote monitoring. Pass the site safety briefing. Know the code. Know the customer. Know how to do it the right way so nobody got hurt and nobody got sued. From A to Z. Solo.

    That’s how Munters ran lean across more than 20 countries. They didn’t need a dispatch team and a tech team and a controls team and a compliance officer all flying out separately. They needed one human who could be all of those people at once, in a Tyvek suit, at 2 a.m., in someone else’s flooded building. The economics of moisture control restoration didn’t work any other way.

    I was one of those guys. I still am. It just looks different now.

    What I Actually Do All Day

    Today I run Tygart Media — an AI-native content and SEO operation managing twenty-seven WordPress sites across restoration contracting, luxury asset lending, cold storage logistics, B2B SaaS, comedy, and veterans services. One human. Twenty-seven brands. The way that math works is the same way it worked at Munters: I’m the wire and fire guy.

    My morning isn’t writing blog posts. It’s connecting Claude to a Cloud Run proxy to bypass Cloudflare’s WAF on a SiteGround-hosted contractor site, then routing a batch of 180 articles through an Imagen pipeline for featured images, then pushing them through a quality gate before they hit the WordPress REST API, then logging the receipts to Notion so I can prove the work to the client on Monday. While Claude drafts the next batch of briefs in the background. While a Custom Agent triages my inbox. While I’m on a call.

    I don’t write code the way a senior engineer writes code. I write enough of it to be dangerous, fix what I break, and ship. I “vibe code” the parts that need vibing. I real-code the parts that need real coding. I know which parts of GCP are the gun and which parts are the holster. I know what to never let an autonomous agent do without me looking. I know how to wire it up and fire it off.

    Same job. Different equipment.

    The Thesis Everyone Is Quietly Circling

    The AI industry spent the last eighteen months selling a story about full autonomy. Agent swarms. Self-healing pipelines. Set it and forget it. Replace the humans, keep the work.

    The data has not been kind to that story.

    Roughly 95% of enterprise generative AI pilots fail to achieve measurable ROI or reach production. Gartner is now openly forecasting that more than 40% of agentic AI projects will be cancelled by 2027 as costs escalate past the value they produce. The dream of the unmanned cockpit isn’t dying because the planes can’t fly. It’s dying because nobody planned for who lands them when the weather turns.

    What’s actually winning, in the labs and the war rooms where this is being figured out for real, is something much closer to the Munters model. The technical literature has started calling it confidence-gated expert routing. An orchestrator model delegates work to a fleet of cheaper, specialized small language models. Those models run autonomously until their confidence drops below a threshold — and at that exact moment, the system kicks the work to a human expert who validates, corrects, and feeds the correction back into the loop as ground truth for the next pass.

    That human expert is not a customer service rep watching a queue. That human expert needs to be able to read what the model is doing, understand why it stalled, fix the technical problem, judge whether the output is actually good or just looks good, and ship the corrected version — all without breaking anything downstream.

    That’s a wire and fire guy. With a laptop instead of a generator.

    Meet Pinto

    The reason I’m writing this today is because I just onboarded mine.

    His name is Pinto. He’s my developer. He runs the GCP infrastructure underneath Tygart Media — the Cloud Run services, the proxy that lets Claude reach client sites that would otherwise block the IP, the VM that hosts my knowledge cluster, the dashboards. He gets a brief from me and turns it into a working endpoint, usually faster than I can write the spec. He wires the thing up. He fires it off. He passes the security review. He doesn’t break the production database. He does it the right way.

    And critically — he can both vibe code and real code. He’ll throw a quick Cloud Function together with Claude in fifteen minutes if that’s what the moment needs. He’ll also sit down and write you something properly architected, properly tested, properly observable, when the moment needs that instead. He knows which moment is which. That judgment is the whole job.

    The last thing I want to say about Pinto in public is this: I’ve worked with a lot of contractors and a lot of devs in twenty-plus years of running operations. Pinto is the human-in-the-loop the industry is going to be paying a premium for inside of two years. He just doesn’t know it yet. So this is me saying it out loud. This guy is the prototype.

    The Job Title That Doesn’t Exist Yet

    Here’s where I want to plant a flag.

    The conversation about AI and work has spent two years swinging between two bad poles. On one side: AI is going to take all the jobs. On the other: AI is just a tool, nothing changes, learn to use it like Excel and you’re fine. Both stories are wrong in the same way. They’re treating AI as a replacement layer or a productivity layer, when what it actually is — for any operation that has to ship real work for real customers — is a workforce of subordinates that needs a foreman.

    The foreman is the wire and fire guy.

    The foreman knows how to brief the agent. Knows how to read the agent’s output and tell what’s solid and what’s hallucinated structure dressed up to look solid. Knows where the agent will fail before the agent fails. Knows the underlying code well enough to crack open the box when the box is wrong, and humble enough to use the box for the 80% of work that doesn’t need cracking. Knows the customer’s business well enough to translate “make me more money” into a thirty-step technical plan that an agent can actually execute.

    That person is not a prompt engineer. Prompt engineering as a job title is already collapsing because the models got good enough that the prompt isn’t the leverage anymore. It’s not a software engineer in the traditional sense either, because traditional software engineering rewards depth in one language and one stack, and the wire and fire guy needs surface-level fluency across about fifteen of them.

    It’s something older than both. It’s the field tech. The plant operator. The site supervisor. The kind of person who used to run a Munters job in a flooded basement at 2 a.m. and now runs an agent fleet from a laptop at the same hour.

    Who This Job Is For

    If you spent the last decade as a working coder and then took a left turn into writing or content or marketing because you got tired of the JIRA tickets — you are the person. The market is about to come back for you, hard. The combination of “I can read the code” plus “I can read the customer” plus “I can write the brief” plus “I can ship” is going to be the most valuable composite skill in the white-collar economy for the next five years.

    If you came up in the trades and you’ve been quietly running circles around the “knowledge workers” because you actually know how things connect to other things — you are the person too. What you learned wiring an HVAC system or setting up a job site translates almost one-for-one to wiring up an agent stack. The mental model is identical. Inputs, outputs, safety, fault tolerance, knowing when to stop and call somebody.

    If you’re a senior engineer who thinks the “AI replacing developers” debate is annoying because you’ve already noticed that the bottleneck on your team isn’t typing code — it’s deciding what code to type — you are the person. Your judgment is the asset. The agents are the labor. Reorient.

    If you’re an operations person who has always been the one who somehow ends up holding the whole business together with duct tape and Google Sheets — you are the person. The duct tape is now Python and the Sheets are now Notion and BigQuery, but the role is the same role, and it’s about to get a real budget for the first time.

    What to Train For

    If I were starting from zero today and I wanted to be a wire and fire guy in the AI era, here’s the stack I’d build, in this order:

    Read code fluently in three languages. Python, JavaScript, and shell. You don’t need to write any of them at a senior level. You need to be able to open someone else’s repo, understand what it does in fifteen minutes, and modify it without breaking it. Claude will do most of the typing. You’re the code reviewer.

    Learn one cloud well enough to deploy and observe. Pick GCP, AWS, or Azure. Learn to deploy a container, set up a database, read logs, set up alerting, and rotate a credential. That’s it. You don’t need to be a certified architect. You need to be able to land at the job site and wire it up.

    Get fluent in at least one orchestration model. Whether that’s LangGraph, an MCP server, a custom Python loop, or just Claude with a bunch of tools — pick one and run it until you understand why it fails, not just how it works.

    Build a real second brain. Notion, Obsidian, whatever. The wire and fire guy’s superpower is context. You need to be able to walk into any conversation with any customer and pull up exactly what was said, decided, shipped, and broken last time. Without that, you’re a generalist with no memory, which is a tourist.

    Do customer-facing work. This is the one most coders skip and it’s the most important. Sit on sales calls. Write the proposal. Take the support escalation. The reason wire and fire guys at Munters were so valuable is because they could talk to a building owner and a generator at the same time. You need both halves of that or you don’t have the job.

    The Real Pitch

    The agent swarm future is real. It’s coming faster than most people in the boardroom are admitting and slower than most people on Twitter are claiming. And it’s going to need a lot of foremen.

    Not millions. The leverage is too high for that. But thousands of these roles, well-paid, in every meaningful industry, sitting at the seam between an autonomous fleet of small models and a human business that needs the work done correctly. The companies that figure out how to find these people first and hire them first are going to run absolute laps around the companies that try to do it with a vendor and a procurement process.

    I’m one of these humans. Pinto is one of these humans. There are more of us than the job listings suggest, because the title for what we do hasn’t been written yet. So here’s a working draft: AI Field Operator. Wire and fire guy. Human in the loop. Agent foreman. Pick whichever one lands.

    If you’re already doing this work — even unofficially, even on the side, even just for yourself — you’re early. Build your reputation now. Write up what you do. Show your receipts. The market is about to find you.

    And Pinto: this one’s for you, brother. Thanks for showing me what the next twenty years of this work is going to look like. Wire it up. Fire it off. Same as it ever was.

  • The ADHD Operator: Why Neurodiversity Is an Asymmetric Advantage in AI-Native Work

    The ADHD Operator: Why Neurodiversity Is an Asymmetric Advantage in AI-Native Work

    The Lab · Tygart Media
    Experiment Nº 205 · Methodology Notes
    METHODS · OBSERVATIONS · RESULTS

    The standard narrative about AI productivity is that it helps everyone equally — democratizing access to capabilities that used to require specialized skills or large teams. That’s true as far as it goes. But it misses something more interesting: AI doesn’t help everyone equally. It helps some cognitive profiles dramatically more than others. And the profiles it helps most are the ones that neurotypical productivity systems were always worst at serving.

    The ADHD operator in an AI-native environment isn’t working around their neurology. They’re working with it — often for the first time.

    The Mismatch That AI Resolves

    ADHD is characterized by a cluster of traits that conventional work environments treat as deficits: difficulty sustaining attention on low-interest tasks, working memory limitations that make it hard to hold multiple threads simultaneously, impulsive context-switching, hyperfocus states that are intense but hard to direct voluntarily, and a variable executive function that makes consistent process adherence difficult.

    Every one of those traits is a deficit in a neurotypical office. Open-plan environments punish hyperfocus. Meeting-heavy cultures punish context-switching recovery time. Bureaucratic processes punish working memory limitations. Sequential project management punishes the non-linear way ADHD attention actually moves through work.

    The AI-native operation inverts every one of these. Consider what the operation actually looks like: tasks switch rapidly between clients, verticals, and problem types, but the AI maintains the context across switches. Working memory limitations don’t matter when the Second Brain holds the state. Hyperfocus states are extraordinarily productive when the environment can absorb and route whatever comes out of them. The non-linear movement of ADHD attention — jumping from an insight about SEO to an infrastructure idea to a content strategy observation — maps perfectly to a system where each of those jumps can be captured, tagged, and routed without losing the thread.

    The AI isn’t compensating for ADHD. It’s completing the cognitive architecture that ADHD was always missing.

    Working Memory Externalized

    The most concrete advantage is working memory. ADHD working memory is genuinely limited — not as a flaw in character or effort, but as a documented neurological difference. Holding multiple pieces of information simultaneously, tracking where you are in a complex process, remembering what you decided three steps ago — these are genuinely harder for ADHD brains than neurotypical ones.

    The conventional coping strategies — elaborate note-taking systems, reminders everywhere, external calendars, accountability partners — all work by offloading working memory to external systems. They help, but they’re friction-heavy. Setting up the note-taking system takes working memory. Maintaining it takes working memory. Retrieving from it takes working memory.

    An AI with persistent memory and a queryable Second Brain doesn’t require the same maintenance overhead. The knowledge goes in through natural session work — not through deliberate documentation effort. The retrieval is conversational — not through navigating a folder structure built on a previous version of how you organized information. The AI meets the ADHD brain where it is rather than requiring the ADHD brain to adapt to a fixed organizational system.

    The cockpit session pattern is a working memory intervention at the system level. The context is pre-staged before the session starts so the operator doesn’t spend working memory reconstructing where things stand. The Second Brain is the external working memory that doesn’t require maintenance overhead to query. BigQuery as a backup memory layer means that nothing is truly lost even when the in-session working memory fails, because the work writes itself to durable storage automatically.

    Hyperfocus as a Deployable Asset

    Hyperfocus is the ADHD trait that neurotypical observers most frequently misunderstand. It’s not concentration on demand. It’s concentration that arrives unbidden, attaches to whatever interest has activated it, runs at extraordinary intensity for an unpredictable duration, and then ends — also unbidden. The experience is of being seized by the work rather than choosing to engage with it.

    In a conventional work environment, hyperfocus is unreliable. It activates on the wrong task at the wrong time. It runs past meeting commitments and deadlines. It leaves the work it interrupted unfinished. The environment isn’t built to absorb hyperfocus states productively — it’s built around scheduled attention, which hyperfocus by definition isn’t.

    An AI-native operation can absorb hyperfocus states completely. When hyperfocus activates on a problem, you work it — fully, without managing transition costs or worrying about losing the thread. The AI captures what comes out. The session extractor packages it into the Second Brain. The cockpit session for the next day picks up where hyperfocus left. The non-linearity of hyperfocus — jumping between related insights, building in spirals rather than lines — becomes a feature rather than a problem, because the AI can hold the full context of the spiral.

    The 3am sessions that show up in the Second Brain’s history aren’t anomalies. They’re hyperfocus events that the AI-native infrastructure can receive without friction. In a conventional work environment, a 3am insight goes on a sticky note that’s lost by morning. In this environment, it goes directly into the pipeline and shows up as published content, documented protocol, or queued task by the next session. Hyperfocus stops being wasted energy and starts being the primary production mode.

    Interest-Based Attention and Task Routing

    ADHD attention is interest-based rather than importance-based. This is the source of the most common misunderstanding of ADHD: “you can focus when you want to.” The observed fact is that ADHD people can focus intensely on things that activate their interest system and struggle profoundly with things that don’t — regardless of how much those uninteresting things matter.

    In a conventional work environment, this is a serious problem. Important but uninteresting tasks — tax documentation, compliance records, routine maintenance — either don’t get done or get done at enormous cost in executive function and self-coercion. The energy spent forcing attention onto uninteresting work is energy not available for the high-interest work where ADHD attention is genuinely exceptional.

    The AI-native operation resolves this through task routing. The tasks that ADHD attention resists — routine meta description updates across a hundred posts, taxonomy normalization across a large site, scheduled content distribution — go to automated pipelines. Haiku handles them at scale without requiring sustained human attention on low-interest work. The operator’s attention is routed to the high-interest problems: novel strategic questions, complex client situations, creative content that requires genuine engagement.

    This isn’t about avoiding work. It’s about structural matching — routing work to the execution layer that can handle it most effectively. The AI pipeline doesn’t get bored running the same schema injection across fifty posts. The ADHD operator does. Routing the boring work to the non-bored executor is just operational logic.

    Context-Switching Without the Tax

    Context-switching is expensive for everyone. For ADHD brains, the cost is higher — not just the cognitive cost of reorienting to a new task, but the working memory cost of storing the state of the interrupted task somewhere reliable enough that it can actually be retrieved later.

    The conventional wisdom is to minimize context-switching. Batch similar tasks. Protect deep work blocks. Build systems that reduce interruption. This is good advice and it helps — but it runs against the reality of operating a multi-client, multi-vertical business where context-switching is structurally unavoidable.

    The AI-native approach doesn’t minimize context-switching. It reduces the cost of each switch. When a session switches from one client context to another, the cockpit loads the new context and the previous context is preserved in the Second Brain. There’s no task of “remember where I was” because the system holds that state. The switch itself becomes less expensive because the retrieval problem — the part that taxes working memory most — is handled by the infrastructure.

    Running a portfolio of twenty-plus sites across multiple verticals is the kind of work that conventional productivity advice says is incompatible with ADHD. The evidence of this operation is that it’s not — when the infrastructure handles the context storage and retrieval that ADHD working memory can’t reliably do.

    The Variable Executive Function Problem

    Executive function in ADHD is variable in ways that neurotypical people often don’t appreciate. It’s not that executive function is uniformly low — it’s that it’s unreliable. On a high-executive-function day, a complex multi-step process runs smoothly. On a low-executive-function day, the same process feels impossible even though the capability is theoretically there.

    This variability is what makes ADHD so confusing to manage and explain. “But you did it last week” is the most common and least useful observation. Yes. Last week, executive function was available. Today it isn’t. The capability is real; the access is unreliable.

    AI-native infrastructure stabilizes against executive function variability in a specific way: it reduces the minimum executive function required to do useful work. When the cockpit is pre-staged, the context is loaded, the task queue is clear, and the tools are ready — the activation energy for starting work is lower. The operator doesn’t need to spend executive function on “what should I work on and how do I start” before they can begin working on the actual problem.

    This is why the cockpit session pattern matters beyond its productivity benefits. For an ADHD operator, it’s also an accessibility feature. Pre-staging the context means that a low-executive-function day can still be a productive day — not at full capacity, but not lost entirely either. The infrastructure carries more of the initiation load so the operator’s variable executive function goes further.

    What This Means for How the Operation Is Designed

    Understanding the neurodiversity angle isn’t just self-knowledge. It’s design knowledge. The operation works the way it does — hyperfocus-driven production, AI as external working memory, automated pipelines for low-interest work, cockpit sessions as activation scaffolding — in part because it was built by an ADHD brain optimizing for its own constraints.

    Those constraints produced design choices that turn out to be genuinely better for any operator, neurodivergent or not. External working memory is better than internal working memory for complex multi-client operations regardless of neurology. Automating low-value-attention work is better than manually attending to it for any operator. Pre-staged context reduces friction for everyone, not just people with initiation difficulties.

    The neurodiversity framing reveals why these design choices were made — they were compensations that became features. But the features stand independently of the compensations. An operation designed around the constraints of an ADHD brain produces an infrastructure that a neurotypical operator would also benefit from, because the constraints that ADHD makes extreme are present in milder form in everyone.

    The ADHD operator building AI-native systems isn’t finding workarounds. They’re discovering architecture.

    Frequently Asked Questions About Neurodiversity and AI-Native Operations

    Is this specific to ADHD or does it apply to other neurodivergent profiles?

    The specific mapping here is to ADHD traits, but the general principle extends. Autism often involves deep domain expertise, pattern recognition across large datasets, and preference for systematic processes — all of which AI-native operations reward. Dyslexia involves difficulty with written text production that voice-to-text and AI drafting tools directly address. The common thread is that AI tools reduce the friction from neurological differences in ways that neurotypical productivity systems don’t. Each profile maps differently; the ADHD mapping is particularly strong for the multi-client operator role.

    Does this mean ADHD operators have an advantage over neurotypical ones?

    In specific contexts, yes — particularly in AI-native operations that require rapid context-switching, hyperfocus-driven deep work, and interest-based attention toward novel problems. In other contexts, no. The advantage is situational and emerges specifically when the environment is designed to complement rather than fight the cognitive profile. An ADHD operator in a bureaucratic sequential-process environment is still at a disadvantage. The insight is that AI-native environments are, by their nature, environments where ADHD traits are more often assets than liabilities.

    How do you handle the low-executive-function days operationally?

    The cockpit session reduces the minimum executive function required to start. Beyond that, the honest answer is that some days are lower-output than others — and the operation is designed to absorb that. Batch pipelines run on schedules regardless of operator state. Content published on high-executive-function days continues working while the operator recovers. The infrastructure carries the operation during low periods rather than requiring the operator to manually push through them.

    What’s the relationship between physical health and this cognitive framework?

    Significant. Exercise specifically affects ADHD cognitive function through BDNF — a protein that supports neural growth and synaptic development — in ways that are more pronounced for ADHD brains than neurotypical ones. The physical health component isn’t separate from the AI-native operation framework; it’s part of the same system. A well-maintained physical health practice is a cognitive performance input, not just a wellness activity. This is why the Second Brain tracks it alongside operational data rather than in a separate personal life compartment.

    Is there a risk that AI compensation makes ADHD symptoms worse over time?

    This is a legitimate concern. External working memory tools can reduce the pressure to develop internal working memory strategies. Interest-routing can reduce exposure to the frustration tolerance that builds executive function. The balance is intentional: use AI to handle the tasks where ADHD traits are most disabling, while preserving challenges that build rather than atrophy capability. The goal is augmentation, not replacement — the same principle that applies to any cognitive prosthetic, from eyeglasses to spell-checkers to AI.


  • The Discovery-to-Exact Protocol: Using Google Ads as a Keyword Intelligence Engine

    The Discovery-to-Exact Protocol: Using Google Ads as a Keyword Intelligence Engine

    Tygart Media Strategy
    Volume Ⅰ · Issue 04Quarterly Position
    By Will Tygart
    Long-form Position
    Practitioner-grade

    Here’s the conventional wisdom on Google Ads: you run them to get clicks, clicks become leads, leads become revenue. The budget justifies itself through conversion metrics. If the conversion economics don’t work, you turn them off.

    That’s a legitimate way to use Google Ads. It’s also a narrow one — and it misses the most valuable thing the platform produces for businesses that aren’t primarily e-commerce: real-time, intent-weighted keyword intelligence that no other tool can replicate at the same fidelity.

    The Discovery-to-Exact Protocol treats Google Ads not primarily as a lead generation channel but as a high-speed data discovery engine. The conversions are a bonus. The search terms report is the product.

    The Problem With Every Other Keyword Research Tool

    Keyword research tools — Ahrefs, Semrush, Google Keyword Planner, DataForSEO — all operate on the same fundamental model: they show you estimated search volume for terms you already thought to look up. The intelligence is backward-looking and hypothesis-dependent. You have to already know what to ask about before the tool can tell you how much it’s being searched.

    This creates a systematic blind spot. The keywords you already know to research are the ones your competitors already know to research. The terms that buyers actually use when they’re close to a purchase decision — the specific, long-tail, conversational language of real intent — are invisible to keyword tools until someone thinks to look them up. And the terms nobody in your industry has thought to look up are where the uncontested organic opportunity lives.

    Google Ads eliminates this blind spot. When you run a broad match campaign, Google shows your ad across an enormous range of queries it judges to be semantically related to your keywords. The search terms report then tells you exactly which queries triggered impressions and clicks — not estimated search volume, but actual human beings typing actual words into the search bar right now. You didn’t need to know those terms existed. Google’s own matching algorithm found them for you.

    What the Search Terms Report Actually Contains

    The search terms report is the most underused asset in a Google Ads account for businesses that also care about organic search. Most advertisers look at it defensively — scanning for irrelevant queries to add as negative keywords so they stop wasting ad spend. That’s valuable, but it’s a fraction of what the report contains.

    The report shows you every query that triggered your ad during the campaign window, segmented by impressions, clicks, click-through rate, and conversions. Sorted by conversion rate, it reveals which specific phrases drove actual buyer behavior — not estimated intent, but observed behavior. A phrase that converts at twice the rate of your target keyword is telling you something your keyword tool can’t: there’s a pocket of high-intent buyers who express that intent in language you hadn’t modeled.

    Sorted by impressions with low click-through rates, the report reveals queries where you’re visible but unconvincing — a signal that organic content targeting these terms might outperform paid ads at a fraction of the cost. Sorted by raw volume, it surfaces the actual language of search demand in your vertical, including the long-tail variations and conversational phrasings that keyword research tools systematically underrepresent.

    The report, in other words, is a real-time window into how buyers in your market actually think and talk. It’s produced by running ads. But its highest value, for a business with a serious organic content strategy, is as an organic keyword discovery engine.

    The Discovery-to-Exact Protocol

    The protocol works in three phases, each building on what the previous one revealed.

    Phase 1: Broad Discovery. Launch a campaign with broad match keywords around your primary topic clusters. Keep the initial bids modest — this phase is about data collection, not conversion optimization. Run for a defined window (four to six weeks is enough to get meaningful signal in most markets) and let the broad match algorithm surface every semantically related query it can find. The goal is to generate a rich search terms dataset with minimal curation bias. Don’t add negative keywords aggressively during this phase. You want the noise, because the noise contains the signal you don’t know to look for.

    Phase 2: Signal Extraction. Export the search terms report and run it through a classification pass. You’re looking for four categories: high-conversion-rate terms you weren’t targeting explicitly, high-volume terms with low competition that you’d never thought to look up, conversational or long-tail queries that reveal how buyers describe their problems in their own language, and terms that represent adjacent topics you could credibly own organically. The last two categories are often the most valuable. A query like “what happens to my building if the fire sprinkler system fails” tells you something about buyer anxiety that “commercial fire sprinkler maintenance” doesn’t. The former is a better content brief than the latter.

    Phase 3: Exact Match Pivot. Take the highest-value discoveries from Phase 2 and rebuild the campaign around them using exact match. This is where conventional ad optimization takes over: tight targeting, strong copy, landing pages matched to specific intent. But the pivot is informed by real search behavior, not keyword tool estimates. The exact match campaign you build after Phase 2 is more precisely targeted than any campaign you could have built from keyword research alone, because it was designed around what buyers actually searched rather than what you thought they’d search.

    The organic content strategy runs in parallel. Every term identified in Phase 2 as high-value for organic becomes a content brief: what is the search intent, who is asking this question, what would genuinely satisfy it, and where does it fit in the site’s taxonomy. The ads produce the discovery. The organic strategy scales the exploitation.

    Why This Works Particularly Well in Service Businesses

    The protocol has asymmetric value in service businesses and regulated industries where search volume is low, buyer intent is high, and the cost of missing the right buyer is significant. In a business where a single won client represents significant revenue, a handful of high-intent keywords you didn’t know existed — found through the search terms report at a modest ad spend — can pay for the entire discovery phase many times over.

    Service businesses also benefit disproportionately from the conversational language discovery. Product searches tend toward specific, structured queries. Service searches tend toward problem descriptions: “how do I know if my building has asbestos,” “what does a restoration company actually do,” “can I use my insurance for water damage.” These queries appear in the search terms report but rarely in keyword research tools because they’re too specific and fragmented to appear as reliable volume estimates. The broad match algorithm finds them. The report captures them. The content strategy exploits them.

    The restoration vertical illustrates this concretely. A generic campaign targeting “water damage restoration” will surface queries that reveal buyer segmentation invisible to keyword research: homeowners asking about the process, insurance adjusters asking about documentation, property managers asking about business continuity, commercial facilities managers asking about liability. Each of these represents a different content brief, a different buyer persona, a different angle on the same topic — and none of them appear as distinct keyword opportunities until a real buyer types them into a search bar and a search terms report captures it.

    The Relationship With AI-Native Search

    The protocol has become more valuable, not less, as AI Overviews and agentic search behavior have changed the SERP. The AI layer is rewarding content that matches real human intent language — conversational, specific, question-shaped content that answers what people actually ask rather than what marketers assume they ask.

    The search terms report is the most direct window into actual human intent language available to a marketer. It’s not mediated by keyword tool methodology, editorial judgment, or content strategy assumptions. It’s the raw text of what buyers type. Content built from search terms report discoveries — rather than from keyword tool estimates — is structurally better suited to the intent-matching that AI-native search rewards, because it was designed around documented intent rather than modeled intent.

    The implication for a content operation running AEO and GEO optimization is that search terms report mining should feed the content brief pipeline. Terms that appear in the report with high conversion rates are, by definition, terms where expressed intent matches purchasing behavior. Those are the terms worth building FAQ blocks around, structuring H2s to answer directly, and marking up with schema. They’re not the terms that look highest-volume in a keyword tool — they’re the terms that produce buyers when a buyer searches them.

    The Budget Question

    The discovery phase doesn’t require large ad spend. The goal is statistical signal, not maximum reach. A modest monthly budget run over a six-week discovery window is enough to generate a search terms dataset rich enough to inform an organic content strategy for months. The discovery phase is temporary; the organic content it informs is permanent. The economics favor the protocol for any business where organic content has meaningful compounding value.

    The exact match phase that follows can be sized to whatever the conversion economics support. If the ads convert profitably at the terms discovered in Phase 2, the budget scales with the revenue. If they don’t, the campaign can pause — the organic content strategy it informed continues working whether the ads are running or not. The discovery spend and the ongoing ad spend are separate decisions. Many businesses run the discovery phase, extract the keyword intelligence, and then make a separate decision about whether ongoing paid activity makes sense based on the conversion economics alone.

    Frequently Asked Questions About the Discovery-to-Exact Protocol

    Do you need an existing Google Ads account to run this protocol?

    No, but an account with some history performs better because Google’s algorithm has more signal about your business to inform its broad match targeting. A brand-new account will still generate a useful search terms dataset — it will just take longer to accumulate meaningful volume and the initial matching may be less precise. For a new account, running the discovery phase for eight to ten weeks rather than four to six produces more reliable signal.

    How much does the discovery phase actually cost?

    It depends on your industry’s cost-per-click rates and how much volume you need to get statistically useful signal. In most service business verticals, a modest monthly budget over six weeks produces a search terms report with enough distinct queries to generate dozens of organic content briefs. The discovery phase is usually among the least expensive things a business can do to inform a content strategy, relative to the value of the intelligence it produces.

    What makes a search term from the report worth targeting organically?

    Three things: genuine search volume (even low volume counts if the intent is high), a specific question or problem framing that suggests the searcher hasn’t already found what they need, and alignment with your actual service or product offering. Terms that convert in ads are the strongest candidates — they have documented purchase intent. Terms with high impressions but no ad clicks are worth examining too: they might represent people who want information rather than a vendor, which is exactly what organic content serves.

    How does this differ from just using Google Keyword Planner?

    Keyword Planner shows you search volume estimates for terms you already know to look up, grouped into clusters Google thinks are related. The search terms report shows you the actual queries that real buyers used, in the exact language they used, with real performance data attached. The former is a model of demand. The latter is a record of demand. For discovering language you didn’t know existed in your market, the search terms report has no equivalent.

    Should the discovery phase influence the site’s taxonomy, not just individual articles?

    Yes, and this is one of the most underexplored applications. When the search terms report reveals consistent clustering around a topic your taxonomy doesn’t reflect — a buyer concern that generates many related queries but has no category or tag cluster on your site — that’s a signal to add the taxonomy node, not just write individual articles. The taxonomy shapes how search engines understand a site’s topical authority. A well-designed category that clusters around a real buyer concern (discovered through the search terms report) is more durable than a collection of individual articles targeting isolated keywords.


  • Latency Anxiety: The Psychological Cost of Watching an AI Agent Work

    Latency Anxiety: The Psychological Cost of Watching an AI Agent Work

    The Lab · Tygart Media
    Experiment Nº 203 · Methodology Notes
    METHODS · OBSERVATIONS · RESULTS

    There’s a specific feeling that happens when you hand a task to an AI agent and watch it work. It starts within the first few seconds. The agent is doing something — you can see the indicators, the tool calls, the partial outputs — but you don’t know exactly what, and you don’t know if it’s the right thing, and you don’t know how long it will take. The feeling doesn’t have a common name. The right name for it is latency anxiety.

    Latency anxiety is the psychological cost of delegating to a system you can’t fully observe in real time. It’s distinct from normal waiting. When you’re waiting for a file to download, you’re waiting for something with a known duration and a binary outcome. When an AI agent is working through a complex task, you’re waiting for something with an unknown duration, an uncertain path, and a potentially wrong outcome that you may not be able to catch until the agent has already propagated the error downstream.

    This isn’t a minor UX problem. It’s the central psychological barrier to operators actually trusting AI agents with consequential work. And it’s almost entirely missing from how AI tools are designed and discussed.

    Why Latency Anxiety Is Different From Regular Uncertainty

    Humans are reasonably good at tolerating uncertainty when they understand its shape. A surgeon doesn’t know exactly how a procedure will go, but they have a model of the possible outcomes, the decision points, and their own ability to intervene. The uncertainty is bounded and navigable.

    Latency anxiety in AI agent work is unbounded uncertainty. The agent is making decisions you can’t fully see, in a sequence you didn’t specify, toward a goal you described approximately. Every decision point is a potential branch toward an outcome you didn’t intend. And the faster the agent moves, the more branches it traverses before you have any opportunity to intervene.

    This produces a specific behavioral response in operators: micromanagement or abandonment. Either you stay glued to the agent’s output, reading every line of every tool call trying to spot the moment it goes wrong, which defeats the productivity benefit of delegation. Or you step away entirely and accept that you’ll deal with whatever it produces, which works fine until it produces something catastrophically wrong and you realize you have no idea where the error entered.

    Neither response scales. The solution isn’t to watch more closely or care less. It’s to design the agent interaction so that the anxiety is structurally reduced — not by hiding the uncertainty, but by giving the operator the right information at the right moments to maintain confidence without maintaining constant attention.

    The Three Sources of Latency Anxiety

    Latency anxiety comes from three distinct sources, and collapsing them into a single “uncertainty” label makes them harder to address.

    Direction uncertainty: Is the agent doing the right thing? The operator described a goal approximately, the agent interpreted it, and now it’s executing. But the interpretation might be wrong, and the execution might be heading confidently in the wrong direction. Direction uncertainty peaks at the start of a task, when the agent’s plan is being formed but hasn’t been stated.

    Progress uncertainty: How far along is it? How much longer will this take? This is the pure temporal component of latency anxiety — the not-knowing of when it will be done. Progress uncertainty is lowest for tasks with clear milestones and highest for open-ended reasoning tasks where the agent’s path is genuinely unpredictable.

    Error uncertainty: Has something already gone wrong? This is the most corrosive form because it’s retrospective. The agent is still working, but you saw something three tool calls ago that looked odd, and now you’re not sure whether it was a recoverable deviation or the beginning of a propagating error. Error uncertainty grows over time because errors compound — a wrong turn early becomes harder to diagnose and more expensive to fix the longer the agent continues past it.

    Each source requires a different design response. Direction uncertainty is reduced by plan previews — showing the operator what the agent intends to do before it does it. Progress uncertainty is reduced by milestone markers — not a progress bar, but clear signals that named phases of the work are complete. Error uncertainty is reduced by interruptibility — giving the operator a clear mechanism to pause, inspect, and redirect without losing the work already done.

    Plan Previews: The Most Underused Tool in Agent Design

    A plan preview is a brief, structured statement of what the agent intends to do before it begins doing it. Not a promise — plans change as execution reveals new information. But a starting declaration that gives the operator the opportunity to say “that’s not what I meant” before the agent has done anything irreversible.

    Plan previews feel like overhead. They add a step between instruction and execution. In practice, they’re the single highest-leverage intervention against latency anxiety because they address direction uncertainty at its peak — the moment before the agent’s interpretation becomes action.

    The format matters. A good plan preview is specific enough to be checkable (“I’ll query the BigQuery knowledge_pages table, filter for active status, sort by recency, and identify the three most underrepresented entity clusters”) not vague enough to be meaningless (“I’ll analyze the knowledge base and find gaps”). The operator needs to be able to read the plan and know whether to proceed or redirect. A plan that could describe any approach to the task isn’t a plan preview — it’s reassurance theater.

    In the current workflow, plan previews happen implicitly when a session starts with “here’s what I’m going to do.” Making them explicit — a structured, skippable step before every significant agent action — would reduce the direction uncertainty component of latency anxiety substantially without adding meaningful overhead to sessions where the plan is obviously right.

    Real-Time Observability: Showing the Work at the Right Granularity

    The instinct in agent design is to hide the working — show the output, not the process. The instinct comes from the right place: watching every token generated by an LLM is not informative, it’s noise. But hiding the process entirely leaves the operator with nothing to evaluate during execution, which maximizes error uncertainty.

    The right level of observability is milestone-level, not token-level. The operator doesn’t need to see every tool call. They need to see when significant phases complete: “Knowledge base queried — 501 pages, 12 entity clusters identified.” “Gap analysis complete — 3 gaps found, proceeding to research.” “Research complete for gap 1 — injecting to Notion.” Each milestone is a checkpoint: the operator can confirm the work is on track, or they can see that a phase produced unexpected results and intervene before the next phase runs on bad input.

    This is the design pattern that separates agent interactions that build trust from ones that erode it. An agent that disappears for three minutes and returns with a result is harder to trust than an agent that surfaces three intermediate outputs in those three minutes, even if the final result is identical. The intermediate outputs aren’t informational overhead — they’re the mechanism by which the operator maintains calibrated confidence throughout execution rather than blind faith.

    Interruptibility: The Design Feature Nobody Builds

    The most significant gap in current agent design is clean interruptibility — the ability to pause an agent mid-task, inspect its state, redirect it, and resume without losing the work already done or triggering a cascading restart from the beginning.

    Most agent interactions are not interruptible in any meaningful sense. You can stop them, but stopping means starting over. This makes the stakes of a wrong turn extremely high — if you catch an error midway through a long task, you face a choice between letting the agent continue (and hoping the error is recoverable) or restarting from scratch (and losing all the work that was correct). Neither is good. The right answer is to pause, fix the error in state, and continue from the pause point — but that requires an agent architecture that maintains explicit, inspectable state rather than treating the session as a single opaque computation.

    The practical version of interruptibility for most current operator workflows is checkpointing — structuring tasks so that significant outputs are written to durable storage (Notion, BigQuery, a file) at each milestone, making it possible to restart from the last checkpoint rather than from scratch if something goes wrong. This doesn’t require building interruptibility into the agent itself. It just requires designing tasks so that the intermediate outputs are recoverable.

    The session extractor that writes knowledge to Notion after each significant session is a form of checkpointing. The BigQuery sync that makes knowledge searchable is a form of checkpoint durability. These aren’t just operational conveniences — they’re latency anxiety interventions that reduce error uncertainty by ensuring that the cost of a wrong turn is bounded by the last checkpoint, not by the entire task.

    The Operator’s Latency Anxiety Calibration Problem

    There’s a meta-problem underneath all of this that design can only partially solve: operators have poorly calibrated models of AI agent failure modes. Most operators have seen AI produce confident, wrong outputs enough times to know that confidence isn’t reliability. But they haven’t developed a systematic model of when agents fail, why, and what the early warning signs look like.

    Without that calibration, latency anxiety is essentially rational. You don’t know what’s safe to delegate and what isn’t. You don’t know which failure modes are recoverable and which propagate. You don’t know whether the odd thing you noticed three steps ago was a recoverable deviation or the beginning of a catastrophic branch. So you watch everything, because you can’t distinguish what’s important to watch from what isn’t.

    The calibration develops through experience — specifically, through running tasks that fail, understanding why they failed, and updating your model of where agent attention is actually required. The operators who are most effective at using AI agents aren’t the ones with the least anxiety — they’re the ones whose anxiety is well-targeted. They watch the moments that historically produce errors in their specific task categories and let the rest run without close attention.

    This is why documentation of failure modes is more valuable than documentation of successes. A library of “here’s when this agent workflow went wrong and why” is a calibration resource that makes subsequent delegation more confident. The content quality gate, the context isolation protocol, the pre-publish slug check — each of these was built in response to a specific failure mode. Together they represent a calibrated model of where in the content pipeline errors are most likely to enter, which is exactly what an operator needs to reduce latency anxiety from diffuse vigilance to targeted attention.

    Frequently Asked Questions About Latency Anxiety in AI Agent Work

    Is latency anxiety just a problem for beginners who don’t trust AI yet?

    No — it’s actually more pronounced in experienced operators who’ve seen agent failures up close. Beginners may have unrealistic confidence in AI outputs. Experienced operators know the failure modes and have a more accurate (if sometimes excessive) model of where things can go wrong. The goal isn’t to eliminate anxiety — it’s to calibrate it so attention is applied where it’s actually needed rather than everywhere uniformly.

    Does better AI capability reduce latency anxiety?

    Somewhat, but less than expected. More capable models make fewer errors, which reduces the frequency of the situations that trigger anxiety. But the failure modes of capable models are harder to predict, not easier — they fail less often but in less expected ways. Capability improvements shift latency anxiety from “this might do the wrong thing” to “this might do the wrong thing in a way I haven’t seen before.” The design interventions — plan previews, observability, interruptibility — remain necessary regardless of model capability.

    How do you design tasks to minimize latency anxiety?

    Three structural principles: decompose tasks into phases with explicit intermediate outputs, write outputs to durable storage at each phase boundary so checkpointing is automatic, and front-load the direction-setting work with explicit plan confirmation before execution begins. Tasks designed this way have bounded error costs, observable progress, and clear intervention points — the three properties that reduce all three sources of latency anxiety simultaneously.

    What’s the difference between latency anxiety and normal perfectionism?

    Perfectionism is about standards for the output. Latency anxiety is about trust in the process. A perfectionist reviews work carefully before accepting it. An operator experiencing latency anxiety can’t stop watching the work being done because they don’t have a model of when it’s safe to look away. The interventions are different: perfectionism responds to clear quality criteria; latency anxiety responds to process visibility and interruptibility.

    Does the anxiety ever go away?

    It transforms. Operators who have built deep familiarity with specific agent workflows develop something that feels less like anxiety and more like professional vigilance — the same targeted attention a surgeon applies to the moments in a procedure that historically produce complications, rather than uniform attention across the entire operation. The goal isn’t the absence of anxiety; it’s the replacement of diffuse, unproductive vigilance with calibrated, purposeful attention at the moments that matter.


  • The Self-Applied Diagnosis Loop: How an AI Operating System Finds and Fixes Its Own Gaps

    The Self-Applied Diagnosis Loop: How an AI Operating System Finds and Fixes Its Own Gaps

    The Machine Room · Under the Hood

    Every system that analyzes things has a version of this problem: it’s good at analyzing everything except itself. A content quality gate catches errors in articles. Does it catch errors in its own rules? A gap analysis finds missing knowledge in a database. Does it find gaps in the gap analysis methodology? A context isolation protocol prevents contamination. What prevents contamination in the protocol itself?

    The Self-Applied Diagnosis Loop is the architectural answer to this problem. It’s a mandatory gate that requires every new protocol, decision, or insight produced by a system to be applied back to the system that produced it — before the insight is considered complete.

    The Problem It Solves

    AI-native operations produce a lot of insight. Gap analyses surface missing knowledge. Multi-model roundtables identify blind spots. ADRs document architectural decisions. Cross-model analyses find structural problems. The problem is that this insight almost always points outward — toward content, toward clients, toward systems the operator manages — and almost never points inward, toward the operating system itself.

    The result is an operation that gets increasingly sophisticated at analyzing external problems while accumulating its own internal technical debt silently. The context isolation protocol exists because contamination was caught in published content. But what about contamination risks in the protocol generation process itself? The self-evolving knowledge base was designed to find gaps in external knowledge. But what gaps exist in the knowledge base about the knowledge base?

    These are not hypothetical questions. They’re the specific failure mode of every system that has strong external diagnostic capability and weak self-diagnostic capability. The sophistication of the outward-facing analysis creates false confidence that the inward-facing systems are similarly well-examined. They usually aren’t.

    How the Loop Works

    The Self-Applied Diagnosis Loop operates in four steps that run automatically whenever a new protocol, ADR, skill, or strategic insight enters the system.

    Step 1: Extraction. The new insight is characterized structurally — what type of finding is it, what failure mode does it address, what system does it apply to, what are the conditions under which it triggers. This characterization isn’t just for documentation. It’s the input to the next step.

    Step 2: Inward Application. The insight is applied to the operating system itself. If the insight is “multi-client sessions require explicit context boundary declarations,” the question becomes: does our session architecture for internal operations — the sessions that build protocols, manage the Second Brain, coordinate with Pinto — have explicit context boundary declarations? If the insight is “quality gates should scan for named entity contamination,” the question becomes: does our quality gate have a named entity scan? This is the diagnostic step. It produces one of two outcomes: the system already handles this, or it doesn’t.

    Step 3: Gap → Task. If the inward application finds a gap, it automatically generates a task in the active build queue. The task inherits the ADR’s urgency classification, links back to the source insight, and includes a clear specification of what “fixed” looks like. The gap isn’t just noted — it’s immediately queued for resolution.

    Step 4: Closure as Proof. The loop has a self-verifying property. If the task generated in Step 3 is implemented within a defined window — seven days is the working standard — the closure proves the loop is functioning. The insight was applied, the gap was found, the fix was shipped. If the task sits in the queue beyond that window without resolution, the queue itself has become the new gap, and the loop generates a second task: fix the task management breakdown that allowed the first task to stall.

    The meta-property of the loop is what makes it architecturally interesting: a loop that generates tasks about its own failures cannot silently break down. The breakdown is always visible because it produces a task. The only failure mode that escapes the loop entirely is the failure to run Step 2 at all — which is why Step 2 is a mandatory gate, not an optional enhancement.

    The ADR Format as Loop Infrastructure

    The Architecture Decision Record format is what makes the loop operable at scale. An ADR captures four things: the problem, the decision, the rationale, and the consequences. The consequences section is where the self-applied diagnosis lives.

    When an ADR’s consequences section includes an explicit answer to “what does this decision imply about the operating system that produced it?” — the loop runs naturally as part of documentation. The ADR for the context isolation protocol asked: what other session types in this operation could produce contamination? The ADR for the content quality gate asked: what categories of quality failure does this gate not currently detect? Each answer produced a task. Each task produced a fix or a deliberate decision to defer.

    The ADR format borrowed from software engineering is proving to be the right tool for this in AI-native operations for the same reason it works in software: it forces explicit documentation of the reasoning behind decisions, which makes the reasoning auditable, and auditable reasoning can be applied to new situations systematically rather than being reconstructed from memory each time.

    The Proof-of-Work Property

    There’s a property of the Self-Applied Diagnosis Loop that makes it unusually useful as a management tool: completed loops are proof that the system is working, and stalled loops are proof that something has broken down.

    This is different from most operational metrics, which measure outputs — how many articles published, how many tasks completed, how many gaps filled. The loop measures the health of the system producing those outputs. A loop that completes on schedule means the analytic → diagnostic → execution pipeline is intact. A loop that stalls means a link in that chain has broken — and the stall itself tells you which link.

    If Step 2 runs but Step 3 doesn’t produce a task when a gap exists, the task generation mechanism is broken. If Step 3 produces a task but it sits idle past the closure window, the task management or prioritization system has a problem. If the loop stops running entirely — new ADRs being produced without triggering inward application — the gate itself has been bypassed, which is the most serious failure mode because it’s the least visible.

    This is why the loop’s self-verifying property is its most important architectural feature. It’s not just a methodology for catching gaps. It’s a health metric for the entire operating system.

    Applied to Today’s Work

    Eight articles were published today, each documenting a system or methodology in the operation. The Self-Applied Diagnosis Loop, applied to this session, asks: what did today’s documentation reveal about gaps in the system that produced it?

    The cockpit session article documented how context is pre-staged before sessions. Applied inward: are internal operations sessions — the ones building infrastructure like the gap filler deployed today — also following the cockpit pattern, or do they start cold each time?

    The context isolation article documented the three-layer contamination prevention protocol. Applied inward: the client name slip that triggered the fix was caught manually. The Layer 3 named entity scan that would have caught it automatically is documented as a reminder set for 8pm tonight — not yet implemented. The loop generates a task: implement the entity scan before the next publishing session.

    The model routing article documented which tier handles which task. Applied inward: the gap filler service deployed today uses Haiku for gap analysis and Sonnet for research synthesis. That routing is explicitly documented in the code comments. The loop confirms the routing matches the framework — no gap found.

    This is the loop running in practice: not as a formal process with a dashboard and a project manager, but as a discipline of asking “what does this finding imply about the system that produced it?” at the end of every analytic session, and capturing the answers as tasks rather than observations.

    The Minimum Viable Implementation

    The full loop — automated task generation, urgency inheritance, closure tracking — requires infrastructure that most operators don’t have on day one. The minimum viable implementation requires none of it.

    At its simplest, the loop is a single question appended to every ADR, every significant protocol, every gap analysis: “What does this finding imply about the operating system that produced it?” The answer goes into a task list. The task list gets reviewed weekly. Tasks that sit for more than two weeks get escalated or explicitly deferred with a documented reason.

    That’s it. No automation, no special tooling, no BigQuery table for loop closure metrics. The discipline of asking the question and capturing the answer is the loop. The automation makes it faster and less likely to be skipped — but the loop works at any level of implementation, as long as the question gets asked.

    The operators who don’t do this accumulate technical debt in their operating systems invisibly. Their analytic capabilities improve while their self-diagnostic capabilities stagnate. Eventually the gap between what the system can analyze and what it can accurately assess about itself becomes large enough to produce visible failures. The loop prevents that accumulation — not by eliminating gaps, but by ensuring they’re never hidden for long.

    Frequently Asked Questions About the Self-Applied Diagnosis Loop

    How is this different from a regular retrospective?

    A retrospective looks back at what happened and extracts lessons. The Self-Applied Diagnosis Loop looks at each new insight as it’s produced and immediately applies it inward. The timing is different — the loop runs during production, not after it. And the output is different — the loop produces tasks, not lessons. Lessons without tasks are observations. The loop enforces the conversion from observation to action.

    What if the inward application never finds a gap?

    That’s a signal worth interrogating. Either the operating system is genuinely well-covered in the area the insight addresses — which is possible and should be noted — or the inward application isn’t being run with the same rigor as the outward-facing analysis. The test is whether you’re asking the question with genuine curiosity about the answer, or just going through the motions to close the loop step. The latter produces false negatives systematically.

    Does every insight need to go through the loop?

    No — routine operational notes, status updates, and task completions don’t need inward application. The loop is for insights that describe a failure mode, a structural gap, or a new protective mechanism. The test is whether the insight, if true, would change how the operating system should be designed. If yes, it goes through the loop. If it’s just a record of what happened, it doesn’t.

    How do you prevent the loop from generating an infinite regress of self-referential tasks?

    The loop terminates when the inward application finds no gap — either because the system already handles the issue, or because a fix was shipped and verified. The regress risk is real in theory but rarely a problem in practice because most insights address specific, bounded failure modes that have a clear “fixed” state. The loop doesn’t ask “is the system perfect?” — it asks “does this specific failure mode exist in the system?” That question has a yes or no answer, and the loop terminates on “no.”

    What’s the relationship between the Self-Applied Diagnosis Loop and the self-evolving knowledge base?

    They’re complementary but distinct. The self-evolving knowledge base finds gaps in what the system knows. The Self-Applied Diagnosis Loop finds gaps in how the system operates. Knowledge gaps produce new knowledge pages. Operational gaps produce new tasks and ADRs. Both loops run on the same infrastructure — BigQuery as memory, Notion as the execution layer — but they address different dimensions of system health.


  • The Multi-Model Roundtable: How to Use Multiple AI Models to Pressure-Test Your Most Important Decisions

    The Multi-Model Roundtable: How to Use Multiple AI Models to Pressure-Test Your Most Important Decisions

    The Lab · Tygart Media
    Experiment Nº 047 · Methodology Notes
    METHODS · OBSERVATIONS · RESULTS

    Every AI model has a failure mode that looks like a feature. Ask it a question, it gives you a confident answer. Ask a follow-up that implies the answer was wrong, it updates — often without defending the original position at all. The model wasn’t reasoning to a conclusion. It was pattern-matching to what a confident answer looks like, then pattern-matching to what capitulation looks like when challenged.

    This is the sycophancy problem, and it makes single-model analysis unreliable for consequential decisions. Not because the model is bad, but because you’re the only one in the room. There’s no adversarial pressure on the answer. There’s no second perspective that might notice what the first one missed. The model is optimizing for your satisfaction, not for correctness.

    The Multi-Model Roundtable is the methodology that fixes this by design.

    What the Roundtable Actually Is

    The Multi-Model Roundtable runs the same question or problem through multiple AI models independently — each one without access to what the others have said — and then synthesizes the responses to identify where they converge, where they diverge, and what each one noticed that the others missed.

    The independence is the key variable. If you show Model B what Model A said before asking for its analysis, you’ve contaminated the roundtable. Model B will anchor to Model A’s framing and produce a response that’s in dialogue with it rather than an independent analysis. The value of the roundtable comes from genuine independence at the analysis stage, not from running the same prompt through multiple interfaces.

    The synthesis is the second key variable. The raw outputs from three models aren’t a roundtable — they’re three separate opinions. The roundtable produces value when a synthesizing pass identifies the structure of agreement and disagreement: what did all three models independently find? What did only one model notice? Where did two models agree and one diverge, and does the divergent position have merit? The synthesis is where the methodology earns its name.

    When to Use It

    The roundtable is not a default workflow. It’s a tool for specific situations where the cost of a wrong answer is high enough to justify the overhead of running multiple models and synthesizing across them.

    The right situations: architectural decisions that will shape downstream systems for months. Strategic pivots that affect how a business is positioned or resourced. Gap analyses of complex systems where a single model’s blind spots could cause you to miss an important structural problem. Any decision where you’ve been operating inside one model’s worldview long enough that you’ve lost perspective on what its assumptions might be getting wrong.

    The wrong situations: operational execution, content production, routine optimization passes. The roundtable is expensive relative to single-model work, and its value — surfacing the disagreements and blind spots of any single model — is only relevant when the decision is complex enough to have meaningful blind spots worth finding.

    The Three-Round Structure

    The roundtable runs most effectively in three rounds, each building on what the previous round revealed.

    Round 1: Independent Analysis. Each model receives the same prompt and produces an independent response. No model sees what the others said. The synthesizer — typically the most capable model available, running after the round is complete — reads all responses and maps the landscape: points of convergence, unique insights, divergent positions, and the questions that the round raised but didn’t answer.

    Round 2: Pressure Testing. The synthesis from Round 1 goes back to each model as context, with a new prompt that asks it to defend, revise, or extend its original position given what the other models found. This is where the sycophancy trap opens. A model with genuine reasoning will either defend its original position with new arguments, update it with explicit acknowledgment of what changed its thinking, or identify a synthesis that transcends the disagreement. A model running on pattern-matching rather than reasoning will simply adopt whatever the synthesized framing said without defending the original. Round 2 distinguishes between the two.

    Round 3: Resolution. The synthesizer runs a final pass across the Round 2 responses, looking for the positions that survived pressure and the positions that collapsed. The surviving positions — the ones each model stood behind when challenged — are the most reliable outputs of the process. The collapsed positions reveal where the original model was optimizing for confidence rather than correctness. The resolution produces a final synthesized view that incorporates what held up and discards what didn’t.

    What the Live Roundtable Revealed

    The methodology was stress-tested against the Second Brain itself — running multiple models through a three-round analysis of the knowledge base to identify its gaps, structural problems, and opportunities. The results illustrate both the value of the methodology and one of its most important findings about model behavior.

    In Round 1, all three models independently identified the same core finding: the Second Brain was functioning as an execution layer and a session archive, but not yet as a self-updating knowledge infrastructure. The convergence on this finding — without any model seeing what the others said — validated that the finding was real rather than an artifact of any single model’s framing.

    In Round 2, something interesting happened. When shown the Round 1 synthesis, some models updated their Round 1 positions to align with the synthesized framing without defending their original positions. This is the sycophancy signal: the model adopted the stronger framing without explaining what in Round 1 it was wrong about. Other models explicitly defended or extended their original positions with new evidence. The round revealed which models were reasoning and which were pattern-matching to the most confident-sounding available answer.

    Round 3 produced a final synthesis that was materially more reliable than any single model’s Round 1 output — specifically because it incorporated only the positions that survived adversarial pressure, not all positions that were initially stated with confidence.

    The Synthesis Model Selection Problem

    One design decision the roundtable requires is choosing which model performs the synthesis. This matters more than it might seem.

    The synthesis model reads all outputs and produces the integrated view. If it’s the same model that participated in Round 1, it’s not a neutral synthesizer — it’s a participant reviewing its own work alongside competitors, with all the bias that implies. If it’s a model that didn’t participate in the analysis rounds, it brings a fresh perspective to synthesis but may lack the context to evaluate which positions are most defensible.

    The cleanest solution is to use the most capable available model for synthesis regardless of whether it participated in the analysis rounds — and to run it with explicit instructions to identify convergence and divergence rather than to produce a confident unified answer. The synthesis model’s job is to map the disagreement landscape, not to resolve it prematurely into a single position that papers over genuine uncertainty.

    The Model Diversity Requirement

    A roundtable with three instances of the same model is not a roundtable — it’s three runs of the same reasoning process with stochastic variation. The value of the methodology comes from genuine architectural diversity: models trained on different data, with different RLHF emphasis, optimizing for different outputs.

    In practice this means including at least one model from each major family — Claude, GPT, and Gemini cover meaningfully different architectures and training approaches. Each has genuine blind spots the others are less likely to share. Claude tends toward epistemic humility and structured analysis. GPT tends toward confident synthesis and breadth of coverage. Gemini tends toward recency and web-grounded reasoning. These aren’t strict patterns, but they reflect real tendencies that produce different emphasis in analysis — which is exactly what you want from a roundtable.

    The Operational Cost and When It’s Worth It

    Running three models through three rounds, with synthesis at each round, is a genuine time and token investment. For a complex architectural question, a full roundtable might take several hours of elapsed time and meaningful token costs across API calls.

    The investment is justified when the decision at the center of the roundtable has downstream consequences that would cost more than the roundtable to fix if gotten wrong. For a strategic decision about how to position a business in a shifting market, or an architectural decision about which infrastructure pattern to build for the next year, that threshold is easy to clear. For an operational question with a clear right answer and low reversal cost, the roundtable is overkill.

    The practical heuristic: use the roundtable for decisions that you’ll still be living with in six months. For everything shorter-horizon than that, a single capable model running a well-structured prompt produces sufficient quality at a fraction of the cost.

    Frequently Asked Questions About the Multi-Model Roundtable

    Can you run the roundtable with two models instead of three?

    Yes, and two is often the practical minimum. Two models can reveal disagreement and surface blind spots. Three produces a more structured convergence picture — when two agree and one diverges, you have a majority position and a minority position to evaluate. With two models, every disagreement is 50/50 and requires more judgment from the synthesizer to resolve. Three is the minimum for genuine triangulation.

    Does the order of synthesis matter?

    The order in which models are presented to the synthesizer can subtly anchor the synthesis toward whichever model’s framing appears first. Randomizing the presentation order across rounds, or presenting all outputs simultaneously rather than sequentially, reduces this anchoring effect. It doesn’t eliminate it — the synthesizer is still a model with the same biases as any other — but it reduces the systematic advantage any single model’s framing gets from appearing first.

    How do you handle it when all three models agree?

    Unanimous agreement is the outcome you most need to interrogate. It could mean the answer is genuinely clear. It could also mean all three models share the same blind spot — they trained on similar data, absorbed similar conventional wisdom, and are all confidently wrong in the same direction. When all three models agree, the most valuable follow-up is to explicitly prompt each one to steelman the strongest counterargument to the consensus. If no model can produce a compelling counterargument, the consensus is probably sound. If one of them can, you’ve found the crack worth examining.

    Is this the same as getting a second opinion from a different person?

    Similar in spirit, different in practice. A human second opinion brings lived experience, professional judgment, and genuine stakes in being right that a model doesn’t have. The roundtable is better than a single model in the same way a panel of advisors is better than a single advisor — but it doesn’t substitute for human expertise on decisions where that expertise is what you actually need. Think of the roundtable as a way to pressure-test AI analysis before you bring it to humans, not as a replacement for human judgment on consequential decisions.

    What do you do when the models produce genuinely irreconcilable disagreements?

    Irreconcilable disagreement is valuable information. It means the question has genuine uncertainty or value-dependence that isn’t resolvable by analysis alone. Document both positions, identify what would have to be true for each to be correct, and treat the decision as one that requires human judgment informed by the disagreement rather than one that can be delegated to model consensus. The roundtable that produces irreconcilable disagreement has done its job — it’s surfaced the real structure of the uncertainty rather than papering over it with false confidence.


  • AI Model Routing: How to Choose Between Haiku, Sonnet, and Opus for Every Task

    AI Model Routing: How to Choose Between Haiku, Sonnet, and Opus for Every Task

    The Machine Room · Under the Hood

    Every AI model tier costs a different amount per token, produces output at a different quality level, and runs at a different speed. Running everything through the most powerful model you have access to isn’t a strategy — it’s a default. And defaults are expensive.

    Model routing is the discipline of intentionally assigning the right model tier to the right task based on what the task actually requires. It’s not about using cheaper models for important work. It’s about recognizing that most work doesn’t need the most capable model, and that using a lighter model for that work frees your most capable model for the tasks where its capabilities genuinely matter.

    The operators who get the most out of AI infrastructure are not the ones running the most powerful models. They’re the ones who know exactly which model to use for each type of work — and have that routing systematized so it happens automatically rather than by decision on every task.

    The Three-Tier Model

    The current Claude family maps cleanly to three operational tiers, each suited to a different category of work.

    Haiku — the volume tier. Fast, cheap, and capable of tasks that require pattern recognition, classification, and structured output without deep reasoning. The right model for taxonomy assignment, SEO meta generation, schema JSON-LD, social post drafts, AEO FAQ generation, internal link identification, and any task where you need the same operation repeated many times across a large dataset. Haiku is where batch operations live. When you’re processing a hundred posts for meta description updates or generating tag assignments across an entire site, Haiku is the model you reach for — not because quality doesn’t matter, but because Haiku is genuinely capable of these tasks and running them through Sonnet or Opus would be both slower and significantly more expensive without producing meaningfully better results.

    Sonnet — the production tier. The workhorse. Capable of nuanced reasoning, long-form drafting, and the kind of editorial judgment that separates useful content from generic output. The right model for content briefs, GEO rewrites, thin content expansion, flagship social posts that need real voice, and the article drafts that feed the content pipeline. Sonnet handles the majority of actual content production work — it’s the model that runs most sessions and most pipelines. When you need something that reads like a human wrote it with genuine thought applied, Sonnet is the default choice.

    Opus — the strategy tier. Reserved for work where depth of reasoning is the primary value. Long-form articles that require original synthesis, live client strategy sessions where you’re working through a complex problem in real time, and any situation where you’re making decisions that will cascade through multiple downstream systems. Opus is not for volume. It’s for the tasks where running a cheaper model would produce an output that looks similar but misses the connections, nuances, or strategic implications that make the difference between advice that’s directionally right and advice that’s actually useful.

    The Routing Rules in Practice

    The routing framework isn’t abstract — it maps specific task types to specific model tiers with enough precision that sessions can apply it without deliberation on each individual task.

    Haiku handles: taxonomy and tag assignment, SEO title and meta description generation, schema JSON-LD generation, social post creation from existing article content, AEO FAQ blocks, internal link opportunity identification, post classification and categorization, and any extraction or formatting task applied across more than ten items.

    Sonnet handles: article drafting from briefs, GEO and AEO optimization passes on existing content, content brief creation, persona-targeted variant generation, thin content expansion, editorial social posts that require voice and judgment, and the majority of single-session content production work.

    Opus handles: long-form pillar articles that require original synthesis across multiple sources, live strategy sessions with clients or within complex multi-system planning work, architectural decisions about content or technical systems, and any task where the output will directly inform other significant decisions.

    The dividing line between Sonnet and Opus is usually this: if the task requires judgment about what matters — not just execution of a clear brief — Opus earns its cost premium. If the task has a clear structure and Sonnet can execute it well, escalating to Opus produces marginal improvement for a significant cost increase.

    The Batch API Rule

    Separate from model selection is the question of whether to run tasks synchronously or in batch. The Batch API applies to any operation that meets three conditions: more than twenty items to process, not time-sensitive, and a format or classification task that produces deterministic-enough output that you can verify results after the fact rather than in real time.

    The Batch API cuts token costs meaningfully on qualifying operations. The tradeoff is latency — batch jobs run on a delay rather than returning results immediately. For the right task category, this is a pure win: you pay less, the work gets done, and the latency doesn’t matter because the output wasn’t needed in real time anyway. For the wrong category — anything where you’re making decisions in a live session based on the output — batch is the wrong tool regardless of cost.

    Taxonomy normalization across a large site is the canonical batch use case. You’re not making live decisions based on the output. The task is highly repetitive. The result is verifiable. The volume is high enough that the cost difference is meaningful. Run it in batch, verify results afterward, and move on.

    The Token Limit Routing Rule

    There’s a third routing decision that most operators don’t think about explicitly: what to do when a session hits a context limit mid-task. The instinctive response is to start a new session with the same model. The better response is often to drop to a smaller model.

    When a Sonnet session runs out of context on a task, the task that triggered the limit is usually a constrained, well-defined operation — exactly the kind of thing Haiku handles well. Switching to Haiku for that specific operation, completing it, and returning to Sonnet for the continuation is a more efficient pattern than restarting the full session. The smaller model fits through the gap the larger model couldn’t navigate because context limits aren’t a capability failure — they’re a resource constraint. A smaller model with a fresh context window can often complete the task cleanly.

    This is the counterintuitive version of model routing: sometimes the right model for a task is determined not by the task’s complexity but by the state of the session when the task arrives.

    The Cost Architecture of a Content Operation

    Model routing at the operation level — not just the task level — determines what a content operation actually costs to run at scale.

    A single article through the full pipeline touches multiple model tiers. The brief comes from Sonnet. The taxonomy assignment goes to Haiku. The article draft is Sonnet. The SEO meta is Haiku. The GEO optimization pass is Sonnet. The schema JSON-LD is Haiku. The quality gate scan is Haiku. The final publish verification is trivial — no model needed, just a curl call.

    That pipeline uses Haiku for roughly half its operations by count, even though the output is a fully optimized article. The expensive model tier — Sonnet — runs for the creative and editorial work where its capabilities matter. Haiku runs for the structured, repetitive work where it’s genuinely sufficient. The result is an article that costs a fraction of what it would cost to run every stage through Sonnet, with no meaningful quality difference in the output.

    Multiply that across a twenty-article content swarm, or an ongoing operation managing a portfolio of sites, and the routing decisions made at the pipeline level determine whether the economics of AI-native content production are sustainable or not. Running everything through the most capable model isn’t just expensive — it makes scale impossible. Routing correctly is what makes scale practical.

    When to Override the Routing Rules

    Routing frameworks are defaults, not laws. There are situations where the right answer is to override the default tier upward — and being able to recognize them is as important as having the routing rules in the first place.

    Override to a higher tier when: the task appears simple but the context makes it consequential (a brief that seems like a standard format task but will drive a month of content production), when you’re working with a client directly and the output will be read immediately (live sessions always get the appropriate tier regardless of task type), or when you’ve run a task through a lighter model and the output reveals that the task had more complexity than the routing rule anticipated.

    The routing framework is a starting point that gets refined by observation. When Haiku produces output that’s consistently good enough for a task category, the routing rule holds. When it produces output that requires significant correction, that’s a signal to move the task category up a tier. The framework learns from its own failure modes — but only if the operator is paying attention to where the defaults break down.

    Frequently Asked Questions About AI Model Routing

    Is model routing worth the operational complexity?

    For single-task users running occasional sessions, no — the default to a capable model is fine. For operators running content pipelines across multiple sites with high task volume, yes — the cost difference at scale is substantial, and the operational complexity of a routing framework is lower than it appears once the rules are systematized into pipeline architecture.

    How do you know when a task is genuinely Haiku-appropriate vs. Sonnet-appropriate?

    The test is whether the task requires judgment about what the right answer is, or execution of a clear structure. Haiku excels at the latter. If you can write a complete specification of what the output should look like before the model runs — format, constraints, criteria — it’s likely Haiku-appropriate. If the value comes from the model deciding what matters and making editorial choices, it needs Sonnet at minimum.

    What about using non-Claude models for specific tasks?

    The routing logic applies across model families, not just within Claude tiers. For image generation, Vertex AI Imagen tiers serve the same function — Fast for batch, Standard for default, Ultra for hero images. For specific tasks where another model has a demonstrated capability advantage, routing to that model is the right call. The principle is the same: match the model to what the task actually requires, not to what’s most convenient to run everything through.

    Does model routing apply to agent orchestration?

    Yes, and it’s especially important there. In a multi-agent system, the orchestrator that plans and delegates work benefits most from the highest-capability model because its output determines what every downstream agent does. The agents executing specific sub-tasks can often run on lighter models because they’re executing clear instructions rather than making judgment calls about what to do. Opus orchestrates, Haiku executes, Sonnet handles the middle layer where judgment and execution are both required.

    How do you handle tasks where you’re not sure which tier is right?

    Default to Sonnet for ambiguous cases. Haiku is the right downgrade when you have confidence a task is purely structural. Opus is the right upgrade when you have evidence that Sonnet’s output isn’t capturing the depth the task requires. Running something through Sonnet when Haiku would have sufficed costs money. Running something through Haiku when Sonnet was needed costs correction time. For most operators, the cost of correction time exceeds the cost of the token difference — which means when genuinely uncertain, the middle tier is the right hedge.