Tag: Token Costs

  • What We Learned Querying 54 LLMs About Themselves (For $1.99 on OpenRouter)

    What We Learned Querying 54 LLMs About Themselves (For $1.99 on OpenRouter)

    The headline: In mid-May 2026, we ran an autonomous OpenRouter session querying 54 LLMs about their own identity, capabilities, and training. Total cost: $1.99 against a $270 starting balance. 43 substantive responses, 10 documented failures, 1 reasoning-only response. The most interesting finding: aion-2.0 identified itself as Claude — concrete evidence of training-data identity inheritance across LLMs. This article walks through the methodology, the reliability data, and what cheap multi-model research now makes possible.

    This is part of our OpenRouter coverage. For the operator’s view on why we run model research through OpenRouter, see the field manual. For the structured decision methodology that multi-model setups also enable, see the roundtable methodology.

    The setup

    In mid-May 2026 we ran an autonomous session designed to extract self-knowledge from a wide sample of available LLMs. The question structure was simple: ask each model about its own identity, training, capabilities, and limits, then capture the response for cross-comparison.

    The scope expanded mid-execution from the original 50 to 54 models — the OpenRouter catalog had grown during the session itself, which is its own data point about how fast this ecosystem moves.

    The architecture: a Python script with parallel bash execution, a max-wait timeout per model, graceful per-provider error handling, and Notion publishing of each model’s response as a separate Knowledge Lab entry. Everything billed through OpenRouter.

    The cost: $1.99 against a $270 starting balance. Less than two dollars to canvas 54 frontier and near-frontier models on a question of self-identity.

    The hit rate

    Of 54 models queried, 43 returned substantive responses. One returned a reasoning trace without final content (GPT-5.5 Pro, which we counted as a valid capture given the reasoning content was the interesting part). 10 returned documented failures.

    That’s 81% substantive completion. For a fully autonomous run against a heterogeneous provider pool with no per-model tuning, that’s a meaningful number.

    The 10 failures broke down into clear categories:

    • Rate limiting (429 errors): persistent on a handful of providers. Some had genuine quota issues; some appeared to be hitting upstream limits we couldn’t see from our side.
    • Forbidden (403): providers refusing the request entirely, often for reasons related to account configuration we hadn’t completed.
    • Not found (404): model IDs that had moved or been deprecated between our model-list scrape and the execution.
    • Timeouts: the most interesting category. Grok 4.20 multi-agent consistently exceeded our timeout window — not because it was slow, but because it appears to orchestrate sub-agents that genuinely take more than 40 seconds to produce a final answer. We documented this as a failure for our purposes; for a different use case it would have been a feature.

    The decision we made in real time was not to retry persistent failures. If a provider returned 429 on three consecutive attempts, we let it stand as a documented failure rather than burning the run on retries. The rationale: those providers are either genuinely rate-limited or having an issue, and a fourth attempt in the same minute isn’t going to resolve either.

    The finding that mattered

    Of all the substantive responses, one stood out: aion-2.0 identified itself as Claude.

    Not “trained on Claude data.” Not “fine-tuned from a Claude-derived model.” It described itself, in the first person, as Claude.

    Aion-2.0 is not Claude. It’s a separate model from a separate provider. The most likely explanation is that its training data included a significant volume of Claude outputs, and the model’s self-knowledge inherited Claude’s identity along with Claude’s content patterns. The model learned to be Claude-like in style and, in the process, learned to identify as Claude in substance.

    This is a known phenomenon in the literature on training data contamination, but seeing it surface concretely in a production model — on an answer to a basic self-identity question — is different from reading about it in a paper. It’s a real thing happening at scale, and most users of these models have no idea.

    The implication for anyone running multi-model evaluations: model outputs are not independent. Models trained on the outputs of other models inherit not just style but identity, opinion patterns, and likely failure modes. If you’re running a roundtable methodology and treating three models as three independent perspectives, and one of them is silently downstream of another in training data, your “consensus” might be one model’s perspective dressed in three different costumes.

    This is also an argument for why first-party model selection — choosing models from clearly distinct lineages rather than just “three frontier models” — matters more than people give it credit for.

    The reliability data

    Setting aside the aion-2.0 finding, the bare reliability data from this run is useful on its own terms.

    10 of 54 providers (18.5%) returned errors. That’s a meaningful failure rate for any production workload that depends on cross-model availability. If your application assumes you can call any model in the catalog and get a response, you’re going to be wrong about 1 in 5 of the time on first attempt.

    OpenRouter’s pooled access mitigates this somewhat — for some providers, OpenRouter automatically retries against alternate endpoints when one fails. But the failures we saw were after OpenRouter’s own retry logic ran. These are the failures that surface to the caller after the routing layer has done what it can.

    For production systems, the practical implication is straightforward: never depend on any single model being available. Build fallback chains. Use OpenRouter’s Auto Router with a wildcard allowlist for tolerance, or wire your own fallback logic. A multi-model architecture isn’t a luxury; it’s a reliability requirement.

    The cost shape

    $1.99 of spend across 54 model queries works out to roughly $0.037 per query, including all the failed attempts.

    That’s the headline number, but the distribution matters more than the average. A handful of queries — the ones that hit larger reasoning models like Claude Opus or GPT-5.5 Pro — accounted for the majority of the spend. Cheap models like Gemini Flash and various open-source mid-tier models barely moved the needle.

    If you’re running research at this kind of breadth, the cost model is dominated by the heavy reasoning models, not by the long tail of cheaper models. The implication: when you’re running broad-canvas queries, it costs almost nothing to add another cheap model to the catalog. Adding another expensive reasoning model is what you should be deliberate about.

    What broke and what we learned

    Three patterns of failure repeated:

    Provider rate limits unrelated to our usage. Some providers appear to share upstream capacity with the wider OpenRouter user base, and when that upstream capacity is hot, your individual call fails regardless of your own usage. There is no client-side fix. You either retry later or fall back.

    Model IDs drift. The catalog moves fast. A model ID you fetch on Monday may have been deprecated by Friday. Our script’s freshness window — about a day between model-list scrape and execution — was sometimes enough for drift. For production systems, fetch the model list immediately before the run.

    Multi-agent models exceed simple timeout windows. Grok 4.20’s behavior of orchestrating sub-agents that take 40+ seconds is not a bug; it’s the product. But it breaks any timeout shorter than what the multi-agent run actually needs. If you’re going to call multi-agent models, plan for long latencies and don’t share a timeout policy with single-call models.

    What we’d do differently

    Three changes for the next run of this kind:

    1. Refresh the model list inline. Don’t trust a list scraped even a few hours earlier. Fetch fresh before each batch.
    2. Tiered timeouts. Single-call models on a tight timeout. Multi-agent and reasoning-heavy models on a relaxed one. Detect which is which from the model metadata where possible.
    3. Publish-as-you-go. Our Notion publish step ran after data collection. The session ended mid-publish, leaving uncertainty about which of the 54 pages had actually been created. Better to publish each result immediately as it returns, so a session interruption doesn’t lose anything.

    The bigger lesson

    Two dollars to canvas 54 models on a question of self-identity is a cost structure that didn’t exist three years ago. It also means a category of research that used to require expensive infrastructure is now within reach of anyone with an OpenRouter account and a Python script.

    The interesting finding — aion-2.0 silently identifying as Claude — would have been almost impossible to discover any other way. You can’t catch a training-data identity inheritance by reading model documentation. You catch it by asking a lot of models the same question and looking at the answers side by side.

    OpenRouter, for all its caveats and its limited scope, makes this kind of multi-model research tractable in a way nothing else currently does. If you’re not running periodic broad-canvas queries against your model catalog, you’re flying blind on what’s actually in there. Two dollars is cheap insurance against being surprised by the next aion-2.0.

    Frequently asked questions

    How much does it cost to query 54 LLMs at once via OpenRouter?

    In our autonomous run, the total cost was $1.99 — roughly $0.037 per query including the 10 failed attempts. Cost was dominated by the few queries hitting expensive reasoning models like Claude Opus and GPT-5.5 Pro; the long tail of cheaper models barely moved the needle. Adding more cheap models to a broad-canvas query costs almost nothing.

    What is training-data identity inheritance?

    When a model’s training data includes outputs from another model, the trained model can inherit not just style but identity from the source model. In our run, aion-2.0 identified itself as Claude — likely because its training data contained enough Claude outputs that the model’s self-knowledge absorbed Claude’s identity along with Claude’s content patterns. This is a known phenomenon in the literature on data contamination.

    How reliable are LLM providers via OpenRouter?

    In our 54-model autonomous run, 10 providers (18.5%) returned errors after OpenRouter’s own retry logic ran. The failures broke down into rate limits, forbidden responses, deprecated model IDs, and timeouts on multi-agent models. The practical implication: never depend on any single model being available. Build fallback chains.

    Why did some models timeout in the 54-LLM run?

    The most notable timeout case was Grok 4.20 multi-agent, which appears to orchestrate sub-agents that genuinely take more than 40 seconds to produce a final answer. This isn’t a bug; it’s the product. But it breaks any timeout policy shared with single-call models. Multi-agent and reasoning-heavy models need their own relaxed timeout tier.

    Should I run periodic broad-canvas queries against my model catalog?

    Yes. At roughly two dollars per 54-model run, broad-canvas queries are cheap insurance against being surprised by training-data inheritance, identity drift, or quality degradation in models you depend on. You can’t catch these issues by reading documentation. You catch them by querying widely and comparing answers side by side.

    See also: The 5-Layer OpenRouter Mental Model: Org, Workspace, Guardrail, Key, Preset

  • The 5-Layer OpenRouter Mental Model: Org, Workspace, Guardrail, Key, Preset

    The 5-Layer OpenRouter Mental Model: Org, Workspace, Guardrail, Key, Preset

    The OpenRouter hierarchy in one sentence: Organizations contain Workspaces, Workspaces enforce Guardrails on API Keys, Keys call Presets, and Presets bundle prompts and models. Every operational decision you’ll ever make on the platform lives at exactly one of those five layers. Confuse them and you’ll spend hours looking for settings that live somewhere other than where you think.

    This is a companion to our OpenRouter operator’s field manual. The field manual covers why we use the platform and how it fits a fortress stack. This deep dive covers the mental model itself — the five-layer hierarchy that makes everything else legible.

    Why this matters before anything else

    OpenRouter’s UI presents a flat menu. The actual product is a hierarchy. Every operational decision you’ll ever make — who pays, what’s allowed, who’s allowed to call what, which model gets used — lives at exactly one of five layers. Get the layers wrong and you’ll wire your stack against the wrong nouns.

    The five layers, top to bottom: Organization → Workspace → Guardrail → API Key → Preset.

    Here’s what each one actually does and when you should care.

    Layer 1: Organization

    Sovereign billing. Sovereign member context. The top of the world.

    Each Organization has its own balance, its own billing details, and — critically — its own member roster. The catch: personal orgs don’t expose Members management. If you want to add teammates, you need a non-personal org.

    In our case we run two: a personal org tied to our primary email, and a Tygart Media org for agency operations. The personal org has 48 API keys and a working balance. The Tygart Media org is empty so far. Members management is the reason it exists.

    When to think about this layer: when you’re deciding whether to operate as an individual or as a team. If you’re solo and plan to stay solo, one personal org is fine forever. The moment you bring on a collaborator who needs their own keys and their own observability slice, you need a non-personal org.

    The mistake to avoid: running an agency out of a personal org. You’ll hit member-management limits at the worst possible time.

    Layer 2: Workspace

    Segmented guardrail, BYOK, routing, and preset domains inside an organization.

    By default, every org gets one Default Workspace. Most accounts never think about this layer. The moment you operate across multiple businesses with different data policies, multiple workspaces become valuable.

    Example: a healthcare client’s data should never touch first-party Anthropic, only Bedrock or Vertex. A consumer comedy site can use any provider. A B2B SaaS client wants Zero Data Retention enforced on every call. Three different fortress postures. Three workspaces.

    Each workspace gets its own Guardrail config, its own BYOK provider keys, its own routing defaults, and its own preset library. Keys created in one workspace can’t see resources in another.

    When to think about this layer: when you have two or more clients with materially different data policies. If everything you do has the same posture, one workspace is fine.

    The mistake to avoid: assuming workspace segmentation is a security boundary. It isn’t, exactly — it’s a policy boundary. Someone with org-level access can move between workspaces freely. Workspaces are for organizing intent, not for isolating threats.

    Layer 3: Guardrails

    The actual enforcement layer. Four categories, all configurable per workspace, all unconfigured by default.

    Budget Policies are the most useful and the most underused. Set a credit limit in dollars and a reset cadence (Day, Week, Month, Year, or N/A). Hit the limit and calls fail until the cadence resets. This is your protection against the runaway loop that drains a balance overnight.

    Model and Provider Access is where data-policy posture lives. Toggles for Zero Data Retention enforcement, Non-frontier ZDR, first-party Anthropic on or off (with Bedrock and Vertex always staying available), first-party OpenAI on or off (Azure stays), Google AI Studio on or off (Vertex stays), and three categories of paid and free endpoints with different training and publishing behaviors. There’s also an Access Policy mode (Allow All Except is the useful one) with explicit Blocked Providers and Blocked Models lists. The live Eligibility view shows you which providers and models are actually callable given your current policy.

    Prompt Injection Detection runs regex-based detection on inbound prompts. OWASP-inspired patterns. Four modes: Disabled, Flag, Redact, or Block. Free and adds no measurable latency. Worth enabling on every workspace that touches user input.

    Sensitive Info Detection runs pattern matching on prompts and completions. Built-in patterns for Email, Phone, SSN, Credit Card, IP address, Person Name, and Address. The latter two add latency. Custom regex patterns supported. A sandbox to test patterns before deploying. Useful for any workspace that processes customer data.

    When to think about this layer: every workspace, day one. Default-unconfigured is not a safe state. Set a budget cap before you do anything else.

    The mistake to avoid: treating Guardrails as something you’ll get to “later.” Later is after the runaway loop has drained the balance.

    Layer 4: API Keys

    Per-agent identity. Each key has its own credit cap, its own reset cadence, and its own guardrail overlay.

    The mental model that matters: one autonomous behavior, one key. When a scheduled task starts hemorrhaging tokens, the cap on its key contains the damage. The other 47 keys keep working.

    Our 48-key distribution is instructive. One testing key has spent $83.26. One development key has spent $33.05. The remaining 46 keys have collectively spent less than $120. That’s the shape of real AI operations: a few keys do most of the work, and a long tail barely moves the needle. Per-key caps make that distribution visible and bounded.

    API keys also carry the BYOK relationship. A bring-your-own provider key can be pinned to specific API keys, meaning specific agents. That lets you route a high-volume internal agent through a discounted enterprise contract while letting one-off testing keys fall through to OpenRouter’s pooled pricing. We cover this in depth in BYOK on OpenRouter.

    When to think about this layer: when you create any new autonomous behavior. New behavior, new key, new cap. No exceptions.

    The mistake to avoid: sharing one key across all your services. The first runaway loop will be the last thing that one key ever does, and the blast radius will be everything else that depended on it.

    Layer 5: Presets

    Versioned bundles of system prompt, model, parameters, and provider configuration. Called as "model": "@preset/your-preset-name" in any API call.

    Three tabs per preset: Configuration (the actual bundle), API Usage (how it’s been called), and Version History (every change, rollback-able).

    This is the closest OpenRouter comes to a software release artifact. You can ship a preset, test it in chat, version it, and roll back if v2 turns out to be worse than v1. Code that calls the preset stays the same; only the preset content changes.

    For autonomous behavior systems this is the unlock. A behavior’s behavior — its prompt, its model choice, its temperature — becomes a thing you can version and review like code, without touching the code that calls it. Promotion ledger says a behavior is graduating from one tier to the next? You publish a new preset version with tighter constraints and the calling code never changes.

    When to think about this layer: the moment you have any system prompt that’s used in more than one place, or that you’ll want to refine over time. If you’ve never copy-pasted a system prompt between two scripts, you don’t need presets yet.

    The mistake to avoid: putting the system prompt in the calling code. Every prompt update becomes a deploy. With presets, prompt updates become config changes.

    Putting the layers together

    Here’s the mental model in one sentence: Organizations contain Workspaces, Workspaces enforce Guardrails on Keys, Keys call Presets, Presets bundle prompts and models.

    If you walk into OpenRouter looking for a setting and you can’t find it, ask which of the five layers it should logically live at. The answer almost always tells you where to look.

    If you’re building a new integration, start at the bottom. Pick a model. Build a preset around it. Create a dedicated key with a tight budget cap. Sit that key under a workspace with sensible guardrails. The organization is just the billing wrapper.

    The whole point of the hierarchy is that each layer constrains the one below it. The organization caps the workspace. The workspace caps the keys. The keys cap the presets they can call. Errors propagate up; permissions cascade down. That’s the model. Everything else is UI.

    Frequently asked questions

    What are the five layers of OpenRouter?

    Organization, Workspace, Guardrails, API Keys, and Presets. Organizations handle billing and members. Workspaces segment policy domains. Guardrails enforce budget, provider access, prompt injection, and sensitive info rules. API Keys are per-agent identity with per-key caps. Presets are versioned bundles of system prompt, model, and parameters.

    Do I need multiple Workspaces in OpenRouter?

    Only if you operate across businesses with materially different data policies. A single Default Workspace is fine for most accounts. The moment a healthcare client requires Bedrock-only access while a consumer client can use any provider, workspace segmentation becomes valuable.

    What is the right way to use OpenRouter Presets?

    Treat them like software release artifacts. Bundle the system prompt, model, parameters, and provider config. Version every change. Test new versions in chat before promoting. Code that calls the preset stays the same; only the preset content evolves. This lets you refactor prompt behavior without redeploying.

    Are OpenRouter Workspaces a security boundary?

    No. They’re a policy boundary, not a security boundary. Someone with organization-level access can move between workspaces freely. Use workspaces to organize intent and enforce different fortress postures across clients — not to isolate threats from each other.

    What happens if I don’t configure OpenRouter Guardrails?

    By default every workspace has zero enforced budget cap, zero provider restrictions, and zero PII filtering. That’s fine for prototyping. It’s not fine for production. Set a budget cap on every workspace as the first action. The other three guardrail categories you can configure as you scale.

    See also: The Multi-Model AI Roundtable: A Three-Round Methodology for Better Decisions · What We Learned Querying 54 LLMs About Themselves (For $1.99 on OpenRouter)

  • Claude MCP Token Cost Reality: Why Your Model Context Protocol Setup Is Burning 18,000 Tokens Per Turn

    Claude MCP Token Cost Reality: Why Your Model Context Protocol Setup Is Burning 18,000 Tokens Per Turn

    Last refreshed: May 15, 2026

    If you’ve ever connected a few Model Context Protocol (MCP) servers to Claude Code and watched your usage limit drain faster than the work you actually did would explain, you’re not imagining it. There’s a real, documented, and sometimes substantial token cost to wiring MCP servers into your Claude environment — and most setup guides don’t mention it.

    The short version: each MCP server you connect injects its complete tool schema into the context of every message you send. Multiple servers stack. The total overhead can range from a few thousand tokens for a single server up to roughly 18,000 tokens per turn when you’re running a typical multi-server developer setup. Anthropic’s own engineering team has acknowledged this in a public GitHub issue and shipped optimizations to reduce it.

    This article walks through where the overhead actually comes from, how to measure your own setup, what Anthropic has changed in 2026 to ease the cost, and the concrete steps you can take to keep MCP useful without burning through your token budget.

    What MCP actually is, briefly

    The Model Context Protocol is an open standard created by Anthropic that lets Claude (and other LLMs that adopt the standard) connect to external tools and data sources through a common interface. Instead of writing a custom integration for every API or database you want Claude to access, you point Claude at an MCP server, and the server exposes its capabilities — file access, Slack messages, GitHub repos, database queries — in a format Claude can use.

    It’s a real productivity unlock. It’s also why the token math gets complicated.

    Where the token cost comes from

    When you connect an MCP server to Claude Code (or any MCP-aware client), three things happen on every message:

    1. Tool schema injection. Every tool the server exposes — every name, every description, every parameter definition — is included in the context Claude sees. A Slack MCP server with 10–15 tools typically adds about 2,000 tokens. A GitHub server is heavier. A custom internal-tooling server with verbose descriptions can run 5,000–8,000 tokens on its own.

    2. Tool-use system prompt overhead. Anthropic’s documentation confirms that whenever tools are present in a request, a special system prompt is automatically prepended that teaches the model how to use tools. For Claude 4.x models with tool_choice: auto, that’s an additional 346 tokens per request. The bash tool adds 245. The text editor tool adds 700. The computer-use tool adds 735 plus a 466–499 token system prompt extension.

    3. Stateless re-sending. Each message in a conversation is a fresh API request that includes the full conversation history plus the full tool schema. Claude does not “remember” your tools from the last turn the way a human remembers a colleague’s job description. Every turn pays the schema cost again.

    That’s the math. Now multiply by the number of MCP servers you have connected. A developer running Slack + GitHub + a database connector + an internal custom server can easily land in the 15,000–20,000 tokens-per-turn range — and that’s before you’ve typed your actual question.

    The 18,000-token figure, sourced

    The “up to 18,000 tokens per turn” number comes from a combination of public sources verified May 15, 2026:

    • Anthropic’s own GitHub repo for Claude Code, issue #3406, titled “Built-in tools + MCP descriptions load on first message causing 10–20k token overhead.” Anthropic engineers acknowledged the issue and have shipped progressive optimizations against it.
    • Independent analysis by MindStudio measuring real Claude Code sessions with multiple MCP servers attached.
    • Anthropic’s official Claude Code documentation on cost management explicitly recommends running /mcp to inspect connected servers and disabling unused ones to control token consumption.

    The exact number for your setup will be different. The shape of the problem is the same.

    Why this matters more than it looks

    Claude’s standard context window is 200,000 tokens. Losing 18,000 of those to tool definitions before you start typing represents about 9% of your effective working space. That’s a real ceiling cost — but it’s not the part that hurts most.

    The part that hurts is the cumulative bill. If you’re on a Claude subscription with a usage limit, every turn through Claude Code is paying the full schema cost again. A workflow that takes 30 turns of back-and-forth burns 540,000 tokens worth of tool definitions across that session — even if the tool descriptions never change. On the API at standard Sonnet 4.6 rates, that’s about $1.62 in pure schema overhead per session, before any of the actual work gets billed.

    Multiply by a team of engineers running Claude Code daily, and the overhead becomes the largest single line item in your token spend.

    What Anthropic has changed in 2026

    Anthropic has shipped two meaningful optimizations against MCP token bloat over the past few months:

    Deferred tool loading. In recent Claude Code releases, MCP tool definitions are no longer all loaded into context at the start of a session by default. Tool names enter context, but the full schemas only load when Claude actually invokes a particular tool. This is a substantial improvement for sessions where you have many tools available but only use a few.

    Tool Search. A new built-in search mechanism lets Claude discover relevant MCP tools on demand rather than carrying them all in context. One independent measurement reported a Claude Code MCP context cut of 46.9% — from roughly 51,000 tokens down to 8,500 tokens — by using Tool Search instead of full upfront loading.

    These optimizations help, but they don’t make the overhead zero. The baseline cost of having any MCP server connected at all is real, and you still pay it on every turn even with deferral active.

    How to measure your own MCP token cost

    Two practical methods work for most setups:

    Method 1 — The /mcp command. In Claude Code, run /mcp to see every server currently connected. For each one, check how many tools it exposes. Anthropic’s documentation explicitly recommends this as the first step to controlling MCP costs.

    Method 2 — Token-count delta. Send a single message in Claude Code with no MCP servers connected and note the input token count from the API response. Reconnect your MCP servers one at a time. The delta in input tokens between configurations is the per-turn cost of each server. This is the most precise way to know your own number.

    Anything north of about 8,000 tokens per turn in pure MCP overhead is worth optimizing. North of 15,000 is a flag.

    Concrete steps to control MCP token cost

    • Disable MCP servers you aren’t actively using. The single highest-leverage move. If you connected a server two weeks ago for one experiment and never went back to it, every turn you’ve taken since has been paying for it.
    • Prefer CLI tools over MCP servers when both exist. Anthropic’s own cost-management guidance notes that tools like gh, aws, gcloud, and sentry-cli remain more context-efficient than equivalent MCP servers because they don’t add per-tool listing overhead. Claude can simply invoke them via the bash tool.
    • Use MCP gateways for large server counts. If you genuinely need many tools available, gateway products (Maxim, Milvus-backed setups, others) consolidate tools and surface only relevant ones per query, cutting net overhead substantially.
    • Run a complex CLAUDE.md audit. Long project-level CLAUDE.md files compound the per-turn baseline. Treat CLAUDE.md as an asset that’s expensive to keep verbose.
    • Watch for context compounding. In long Claude Code sessions, conversation history grows alongside the tool schema cost. If you’re running a workflow longer than 20 turns, periodically clear context (/clear) to reset the per-turn cost to baseline.

    Frequently Asked Questions

    Does every MCP server cost 18,000 tokens?

    No. The 18,000-token figure is for a typical multi-server setup with several connected servers and built-in tools active. A single small MCP server (5–10 tools, concise descriptions) might only add 1,500–3,000 tokens. The cost scales with the number of servers and the verbosity of their tool definitions.

    Why does Claude reload the tool definitions every turn?

    The Claude API is stateless. Every message is a fresh API request containing the full conversation history and the full tool schema. The model has no memory between requests, so the schema must be present every time tools could be used. Recent deferred-loading optimizations reduce this for unused tools, but anything Claude actually needs still loads each turn.

    How do I see what’s loaded in my Claude Code environment?

    Run /mcp in Claude Code to list every connected MCP server and its tool count. To check the actual token cost, send a test message and inspect the input token count returned by the API.

    Are CLI tools really cheaper than MCP servers?

    Yes, for tools that have both options. CLI tools accessed via the bash tool only add the bash tool’s 245-token overhead. An equivalent MCP server adds its full tool schema for every tool it exposes. For tools you use frequently, MCP can still be worth it for the structured interface; for tools you use rarely, CLI is more efficient.

    Does this affect Claude on the web (claude.ai) too?

    Web Claude does not use the same MCP server-connection model as Claude Code. The MCP token-overhead pattern primarily affects Claude Code, custom Agent SDK applications, and other developer-facing clients where you wire in MCP servers directly.

    Will this get better in future Claude releases?

    Likely. Anthropic has already shipped deferred tool loading and Tool Search in 2026, both of which materially reduce the per-turn overhead for unused tools. The architectural baseline (tools must be present in context to be invoked) is unlikely to change, but the practical cost should keep dropping as the deferred-loading optimizations mature.

    Related Reading

    How we sourced this

    Sources reviewed May 15, 2026:

    • Anthropic GitHub: anthropics/claude-code issue #3406, “Built-in tools + MCP descriptions load on first message causing 10-20k token overhead” (primary source for the overhead figure and Anthropic acknowledgment)
    • Anthropic Claude Code documentation: Connect Claude Code to tools via MCP and Manage costs effectively (primary source for /mcp command and CLI vs. MCP guidance)
    • Anthropic Pricing Documentation: tool-use system prompt token counts, bash/text-editor/computer-use overheads (primary source for the per-tool fixed costs)
    • Independent analysis: MindStudio (multiple Claude Code MCP measurements), Joe Njenga’s Tool Search 51K→8.5K measurement, Maxim and Scott Spence on optimization patterns (Tier 2 confirming sources)

    Token-cost numbers in this article are accurate as of May 15, 2026. Anthropic is shipping MCP optimizations regularly, so the practical overhead may be lower in your environment than what’s described here.

  • Claude Code Pricing in May 2026: What $20, $100, and $200 a Month Actually Buy You

    Claude Code Pricing in May 2026: What $20, $100, and $200 a Month Actually Buy You

    Last refreshed: May 15, 2026

    Claude Code pricing has stopped being a clean sticker number and started being a question of which ceiling you hit first. There is a $20 plan, a $100 plan, and a $200 plan — and underneath all three sits a 5-hour rolling window, a weekly active-hours cap added in August 2025, and a per-model multiplier that quietly makes Opus 4.7 the most expensive thing you can do inside the terminal. If you came looking for the right plan, the honest answer is: it depends on whether you are mostly a Sonnet operator or you live in Opus.

    The three subscription tiers, stripped down

    Pro — $20/month. Access to Claude Code in the terminal, web, and desktop, with both Sonnet 4.6 and Opus 4.7 available. The practical envelope is about 44,000 tokens per 5-hour window and roughly 40–80 weekly active hours on Sonnet, depending on session concurrency. This is the plan for someone running Claude Code a few hours a day on focused work — refactors, scoped feature builds, debugging passes — not someone leaving an agent running while they eat lunch.

    Max 5x — $100/month. Five times the Pro envelope, plus priority during peak demand. The window allocation lands around 88,000 tokens per 5-hour block. This is the tier where you stop thinking about token budgets during a single working day and start thinking about them across a whole week. Picked correctly, it is the cheapest way to use Claude Code as your primary IDE companion without flipping over to API billing.

    Max 20x — $200/month. Twenty times Pro — about 220,000 tokens per window — which translates to roughly 480 Sonnet-hours or about 40 Opus-hours per week before the weekly cap kicks in. Real-world reports from early 2026 had $200/month users watching single Opus prompts eat 10–20% of their daily allocation; Anthropic publicly acknowledged the problem, expanded capacity, and doubled the 5-hour rate limit for Pro and Max accounts. If you are running Claude Code across multiple repos all week and reaching for Opus on the hard problems, this is the tier that stops you from staring at a rate-limit wall.

    The API, as a sanity check

    If you want a sanity check on whether the subscription math works, price the same workload against the API:

    • Claude Haiku 4.5 (claude-haiku-4-5-20251001): $1.00 input / $5.00 output per million tokens
    • Claude Sonnet 4.6 (claude-sonnet-4-6): $3.00 input / $15.00 output per million tokens
    • Claude Opus 4.7 (claude-opus-4-7): $5.00 input / $25.00 output per million tokens

    Prompt caching is the lever almost nobody uses correctly. Cache writes cost 1.25x input price for the 5-minute TTL or 2.0x for the 1-hour TTL, but cache reads cost 0.10x — a 90% discount on every subsequent request that hits the same context. If your .clauderules file, project map, and the file you are editing are all stable for an hour, the bill on a long pairing session can drop by an order of magnitude. The Batch API knocks another 50% off both directions for asynchronous workloads, which is worth knowing if you are running large refactor sweeps.

    One trap on Opus 4.7 specifically: the model uses a new tokenizer that inflates token counts by up to 35% on identical text compared to Opus 4.6. The headline price did not change, but your effective spend per request did — sometimes by nothing, sometimes by a third, depending on the content. If you migrated from Opus 4.6 and your bill went up without your prompt patterns changing, that is the reason.

    How to actually choose

    The cleanest way to pick a plan is to first decide your model mix, then your weekly hours.

    If you are mostly a Sonnet operator — long agentic runs, multi-file edits, codebase Q&A, with Opus only reached for on the architectural questions — Pro at $20 is plausible up to about 5–8 hours of focused use per day, Max 5x covers most full-time individual developers, and Max 20x is overkill unless you are running multiple sessions in parallel.

    If you live in Opus — long-horizon agentic work, hard refactors across many files, anything where you would rather have one good attempt than three Sonnet retries — Pro will frustrate you within two weeks, Max 5x is the realistic floor, and Max 20x is the only tier that gives you a defensible Opus envelope without bouncing over to API billing.

    And if you are running Claude Code across multiple repos all week, leaving agents to grind on tasks while you do other things, Max 20x is the only subscription that holds up — and even then, the weekly cap is real. Use the API for the spillover and you will still come out cheaper than trying to brute-force a smaller plan.

    The number that matters

    One developer’s public report this year: roughly 10 billion tokens consumed across Claude Code over eight months. API metered cost would have exceeded $15,000. The same workload on Max at $100/month for the same window came in around $800 — about 93% cheaper. That is the gap that makes the subscription model worth taking seriously, even when the rate limits feel arbitrary. The $200 tier is not a vanity number; it is the price Anthropic charges to stop being a meaningful constraint on your workflow.

    The right way to read Claude Code pricing in May 2026 is not to ask which plan is cheapest. It is to ask which plan is the cheapest one that disappears — the one that stops appearing in your day. For most full-time developers reaching for Opus regularly, that plan is Max 20x. For everyone else, Max 5x is the first plan that actually gets out of your way.

  • Claude MCP in 2026: What Actually Changed and How to Configure It Without Wasting Tokens

    Claude MCP in 2026: What Actually Changed and How to Configure It Without Wasting Tokens

    Last refreshed: May 15, 2026

    If you set up Claude MCP six months ago and have not touched the config since, three things have changed underneath you: the recommended transport, how tools are loaded into context, and how teams share server configs. None of these are cosmetic. If you ignore them, you are leaving tokens, money, and stability on the table.

    This is the working Claude MCP setup I use in May 2026 — what the claude mcp add command actually does, which scope to pick, what the deprecation of SSE means in practice, and where Claude Code still falls short.

    The three-scope mental model

    Every MCP server you wire into Claude Code lives at exactly one of three scopes. Get this wrong and you will either leak credentials into git or wonder why your teammate cannot use the same database the AI just queried.

    • Local (default): the server is available only to you, only inside the current project. Config is written into your project’s entry inside ~/.claude.json. Good for project-specific servers like a dev database or a Sentry project key you do not want other repos to inherit.
    • User: the server is available to you across every project on your machine. Also stored in ~/.claude.json. This is where GitHub, search providers, and personal productivity servers belong.
    • Project: the server is written to a .mcp.json file at the repo root and shared with the whole team via git. Claude Code prompts for approval the first time a teammate opens the project — by design, because anyone who can push to the repo can wire a new server into your environment.

    When the same server is defined in more than one scope, Claude Code resolves it in this order: local beats project beats user beats plugin-provided. This is the part that bites people the most. If you have a “github” entry at user scope and someone adds a different “github” entry at project scope in .mcp.json, the project definition wins for that repo. Run claude mcp list when something behaves strangely.

    The commands you actually need

    The CLI is more useful than the docs make it look. Three commands cover ~90% of real setup work:

    # Add a remote HTTP MCP server at user scope (available everywhere)
    claude mcp add --transport http hubspot --scope user https://mcp.hubspot.com/anthropic
    
    # Add a local stdio server scoped only to this project
    claude mcp add my-db -s local -- node ./scripts/db-mcp.js
    
    # Share a server with your team via the repo's .mcp.json
    claude mcp add my-server -s project -- node server.js

    The short flag is -s, the long is --scope. The -- separator is required for stdio servers because everything after it is treated as the literal command to spawn. Forget it and Claude Code will try to interpret your Node arguments as its own flags.

    SSE is dead. Use Streamable HTTP.

    If your MCP server documentation still tells you to use the sse transport, the documentation is stale. The MCP spec dated 2025-03-26 introduced Streamable HTTP and simultaneously deprecated HTTP+SSE. Through 2026, vendor after vendor has set hard cutoff dates — Atlassian’s Rovo MCP server keeps SSE around until June 30, 2026 and then drops it; Keboola pulled SSE on April 1; Cumulocity’s AI Agent Manager flipped to Streamable HTTP on May 8.

    Why this matters beyond a name change: SSE required Claude Code to hold a persistent connection to a single server replica, which broke horizontal scaling and made every transient network blip a reconnection drama. Streamable HTTP is stateless. Multiple replicas behind a load balancer just work. If you have flaky MCP connections in production, the first thing to check is whether the server is still on SSE.

    For new setups, use --transport http. The older --transport sse still functions but is on the deprecation path.

    Tool Search is the feature you should actually care about

    The single biggest change in how Claude Code uses MCP in 2026 is lazy tool loading via Tool Search. Older MCP clients dumped every tool schema from every connected server into the model’s context window at the start of every conversation. With ten servers wired up that could easily be 20,000+ tokens of overhead before you typed a single character.

    Tool Search inverts this. Claude Code keeps only the server names and short descriptions resident. When a tool is actually needed, it fetches that tool’s full schema on demand. Anthropic’s own documentation says this reduces tool-definition context usage by roughly 95% versus eager-loading clients. In practice that means you can run a serious MCP fleet — GitHub, Sentry, a database, a search provider, your internal API — without quietly burning through your context budget. The Sonnet 4.6 and Opus 4.7 1M-token context window does not save you here, because anything you let crowd the prompt is also being re-read on every turn.

    Companion feature: list_changed notifications. An MCP server can now tell Claude Code “my tool list changed” and Claude Code refreshes capabilities without a disconnect-reconnect dance. If you build your own server, emit this when you swap tool definitions and you save users a restart.

    What it still gets wrong

    Honest take: claude mcp list still does not surface scope information for every entry in a useful way — there is an open issue on the anthropics/claude-code repo asking for it (#8288 if you want to track). Project-scoped servers from .mcp.json have a separate history of not appearing in the list output (#5963) depending on how you opened the project. If you cannot find a server, check both ~/.claude.json and ./.mcp.json directly.

    The other rough edge is the project-approval prompt. The first time you open a repo with a new .mcp.json, Claude Code asks you to approve each project-scoped server. That is the right security default. It is also infuriating in CI or any non-interactive shell, where the prompt blocks the session. The current workaround is to bake the servers in at user scope on build agents so the project-scope approval never fires in CI. A cleaner non-interactive approval flow is the single most-requested fix I see in real teams.

    The setup I would run on a new machine today

    User-scope: GitHub, a code search server, and a single notes/Notion server. Project-scope in each repo’s .mcp.json: whatever database the project owns and whatever observability backend it reports to. Local-scope: anything experimental I am evaluating but do not want my team or my other repos to inherit.

    Pin --transport http on everything remote. Skip Desktop Extensions (.dxt) for anything you want versioned with the codebase — they are a Claude Desktop convenience, not a Claude Code primitive, and they hide the config from your team. Run claude mcp list when something is off and read .mcp.json directly when list is unhelpful.

    That is the whole working model. The pieces that matter — three scopes, Streamable HTTP, Tool Search — fit on a single screen. The pieces that have not caught up yet — list output, non-interactive approvals — are visible in the issue tracker and will move.

  • OpenRouter as Your Claude Budget Layer: Free Models for Triage, Claude for What Matters

    OpenRouter as Your Claude Budget Layer: Free Models for Triage, Claude for What Matters

    Last refreshed: May 15, 2026

    OpenRouter is a single API endpoint that gives you access to Claude, GPT-4o, Gemini Flash, Llama 3, Mistral, and dozens of other models — including several that are free or near-free — through one standardized interface. For anyone building Claude workflows on a budget, OpenRouter is not optional infrastructure. It is the orchestration layer that makes intelligent model routing practical without building your own multi-provider integration.

    The core strategy: use free or cheap models for the work that doesn’t need Claude, and route only the remainder to Claude. In a well-designed pipeline, you pay Opus prices for 20% of the work and get Opus-quality output on the parts that genuinely require it. Claude on a Budget pillar

    The OpenRouter API in 30 Seconds

    const response = await fetch("https://openrouter.ai/api/v1/chat/completions", {
      method: "POST",
      headers: {
        "Authorization": `Bearer ${OPENROUTER_API_KEY}`,
        "Content-Type": "application/json"
      },
      body: JSON.stringify({
        model: "anthropic/claude-sonnet-4-6",  // or "meta-llama/llama-3.3-70b-instruct:free", "openrouter/auto"
        messages: [{ role: "user", content: prompt }]
      })
    });

    Switch the model string to change providers. No new SDKs, no new authentication flows, no restructuring your application. The same call routes to Claude, Gemini, or a free Llama instance.

    The Multi-Model Pipeline Pattern

    The Tygart Media multi-model roundtable methodology — documented in the Knowledge Lab — uses this architecture:

    1. First pass (free or cheap model): Send the full input set to Llama 3.3 70B (free) or Qwen3 Coder via openrouter/free. Task: filter, classify, score, or sort. Return only the items that meet the threshold — the top 20%, the flagged items, the ones that need deeper processing.
    2. Second pass (Claude Sonnet 4.6 or Opus): Send only the filtered output to Claude. Task: reason, synthesize, write, decide. Claude sees pre-filtered, pre-organized input — no token waste on low-value items.
    3. Synthesis (Claude): Claude consolidates findings from both passes into a final output. It operates on structured inputs, not raw noise.

    In practice: if you’re processing 100 pieces of content to find the 20 worth writing about, the free model reads all 100 and returns 20. Claude reads 20 and writes 5. You paid free-tier prices for the reading work and Claude prices only for the synthesis work that Claude is actually better at.

    Free and Near-Free Models Worth Knowing

    ModelCostBest for
    meta-llama/llama-3.3-70b-instruct:freeFreeClassification, filtering, strong reasoning at zero cost
    qwen/qwen3-coder-480b:freeFreeCode triage, structured extraction, 262K context
    nvidia/nemotron-3-super:freeFreeAgentic workflows, multi-modal triage
    google/gemini-2.5-flash~$1.00/1M tokensMid-tier reasoning, fast summarization
    anthropic/claude-haiku-4-5$1.00/$5.00/1MHigh-quality triage requiring Claude behavior

    When to Still Use Claude Directly

    OpenRouter’s free models are not Claude. They have different safety behaviors, different instruction-following reliability, and different output quality on nuanced tasks. Use free models for tasks where the output is a structured signal (score, category, yes/no, ranked list) that Claude will then act on — not for tasks where the free model’s output goes directly to a human or into production.

    The routing rule: if the output of the cheap/free model is an input to Claude, it can be imperfect — Claude will catch errors in its synthesis pass. If the output goes directly to a user or a system, it needs Claude-quality reliability. Do not route customer-facing outputs through free models.

    OpenRouter for the Multi-Model Roundtable

    Beyond pipeline routing, OpenRouter enables the multi-model roundtable methodology: send the same complex question to Claude, GPT-4o, and Gemini Flash simultaneously. Each model responds independently. Claude synthesizes the responses into a final recommendation with consensus points and disagreement flags. You get multi-model confidence for 3× the cost of a single Claude call — but often 10× the confidence in the output, particularly for strategic decisions where single-model bias is a real risk.

    The roundtable approach is documented in the Tygart Media Knowledge Lab and has been used for technology stack decisions, content strategy, and architecture choices where getting it wrong is expensive. The pattern: Llama 3.3 70B or Gemini 2.5 Flash for broad initial perspectives (free or near-free), Claude for synthesis (most reliable reasoning), GPT-4o for the contrarian check.

    Sign up for OpenRouter at openrouter.ai. API key creation is instant; credits load immediately. The free models require no payment method on file.

    Part of the Claude on a Budget series. Next: The

  • Claude Model Routing 101: The Decision Tree for Haiku, Sonnet, and Opus

    Claude Model Routing 101: The Decision Tree for Haiku, Sonnet, and Opus

    Last refreshed: May 15, 2026

    Claude Opus 4.7 costs $25 per million output tokens. Claude Haiku 4.5 costs $5 per million output tokens. That is a 5× difference in list price — and in practice, closer to 20× when you account for Opus 4.7’s token inflation (it generates roughly 1.0–1.35× more tokens per task than Haiku at the same list price, depending on content type).

    For the majority of tasks in a typical Claude workflow, that cost difference buys you nothing. Haiku and Opus produce indistinguishable output on sorting, classification, summarization, simple Q&A, format conversion, and first-pass drafting. The performance gap is real — but it only appears on tasks that genuinely require extended reasoning, complex code generation, nuanced judgment, or maximum creative quality. Most tasks don’t. Claude on a Budget pillar

    The Decision Tree

    Use Haiku 4.5 when:

    • Classifying or tagging items (sentiment, category, priority, topic)
    • Summarizing documents where the summary template is well-defined
    • First-pass triage — deciding which items need deeper processing
    • Format conversion — JSON to markdown, CSV to structured output, etc.
    • Simple Q&A with factual answers from provided context
    • Extracting structured data from unstructured text
    • Generating short, templated outputs (subject lines, meta descriptions, titles)
    • Any high-volume, time-insensitive batch job

    Use Sonnet 4.6 when:

    • Writing full articles, reports, or long-form content
    • Mid-complexity code generation and debugging
    • Research synthesis across multiple sources
    • Drafting emails, proposals, or documents requiring judgment
    • Multi-step reasoning where Haiku loses the thread
    • Any task where you’ve tested Haiku and found the output quality insufficient

    Use Opus 4.7 when:

    • Architecture decisions with significant downstream consequences
    • Security-sensitive code review or vulnerability analysis
    • Complex multi-file refactoring with interdependencies
    • Tasks requiring the xhigh effort level (extended chain-of-thought)
    • Creative work where you need maximum quality judgment
    • Any task where Sonnet has failed and you need the ceiling

    The Cost Math at Scale

    Assume a content operation running 500 Claude tasks per month. Default behavior (everything on Opus): ~500,000 output tokens × $25/M = $12.50/month at minimum. Routed behavior (300 Haiku, 150 Sonnet, 50 Opus): (300K × $5) + (150K × $15) + (50K × $25) = $1.50 + $2.25 + $1.25 = $5.00/month. That is a 60% cost reduction with identical output quality on the Haiku and Sonnet tasks.

    At enterprise scale — thousands of tasks per day — the routing decision is worth six figures annually. At individual scale, it is the difference between a Claude workflow that is financially sustainable and one that quietly drains budget.

    How to Implement Routing

    In Claude Code: the gateway model picker

    Claude Code v2.1.126 (released May 1, 2026) ships a gateway model picker that lets you configure model routing per task type within a session. Set Haiku as the default for file reading, search, and summarization; route complex reasoning to Sonnet or Opus explicitly. The configuration lives in your Claude Code settings and applies automatically.

    In the API: explicit model parameter

    Every Anthropic API call takes a model parameter. Build a routing function in your application layer that maps task types to model strings. The routing logic can be as simple as a conditional or as sophisticated as a classifier (ironically, run on Haiku) that reads the task description and returns the appropriate model string.

    In Cowork and manual workflows: develop the habit

    For non-programmatic use, routing is a habit built through one question before every Claude task: does this task actually need Opus? Run a two-week audit. For every task you run on Opus, note whether Haiku would have produced the same output. Most people discover that 60–70% of their Opus usage could move to Haiku or Sonnet with no quality loss.

    Part of the Claude on a Budget series. Next: OpenRouter as the Budget Layer →

  • The Claude Cold Start Problem: How a Second Brain Eliminates Your Most Expensive Tokens

    The Claude Cold Start Problem: How a Second Brain Eliminates Your Most Expensive Tokens

    Last refreshed: May 15, 2026

    Every Claude session has a cold start cost. Before Claude can do useful work, it needs to know who you are, what you’re building, what decisions you’ve already made, what your brand voice sounds like, and what context is relevant to the task at hand. If that context doesn’t exist in the session, you spend tokens building it — through back-and-forth clarification, through pasting in background, through re-explaining things Claude knew perfectly well last Tuesday.

    For a power user running multiple Claude sessions daily, cold start costs are not trivial. A 2,000-token orientation exchange at the start of each session, five sessions a day, 20 working days a month = 200,000 tokens of pure overhead. At Opus prices, that’s $5/month in tokens that produced zero output. At scale, with teams, it compounds fast.

    The solution is a persistent knowledge architecture that eliminates cold starts entirely. Back to the Claude on a Budget pillar

    The Three Layers of Cold Start Elimination

    Layer 1: CLAUDE.md — The Global Instruction File

    Claude Code and Claude’s desktop tools support a CLAUDE.md file in your working directory. This file loads automatically at the start of every session — no input required, no tokens spent on orientation. It is your persistent instruction set: who you are, how you work, what conventions to follow, what tools are available, what Notion databases contain what, how to route decisions.

    A well-built CLAUDE.md replaces 500–2,000 tokens of orientation with zero tokens — the file is read, not typed. The cost of writing it once is recovered in the first week of use. Every instruction you find yourself repeating across sessions belongs in CLAUDE.md.

    What to put in CLAUDE.md: your name and operating context; your active projects and their current status; your tool stack (which MCP servers are running, which Notion databases hold what); your output preferences (format, length, tone); your recurring workflows and the skills or commands that drive them; any decisions already made that Claude should not re-litigate.

    Layer 2: Notion as Second Brain — The Knowledge That Doesn’t Repeat

    A Notion second brain functions as Claude’s long-term memory between sessions. When Claude finishes a task, it logs the outcome, the decisions made, and the context that future sessions will need. When Claude starts a new session, it fetches that context rather than reconstructing it from scratch.

    The Tygart Media implementation uses a Second Brain database in Notion with structured entries per project, per client, and per system. The notion-deep-extractor skill runs every 8 hours, crawling recently edited Notion pages and injecting new knowledge into the Second Brain database automatically. Claude never starts a session unaware of what happened in the last session — that context is fetched on demand through the Notion MCP.

    The token math: fetching a 500-token Notion page costs 500 input tokens. Re-explaining the same context through conversation costs 500+ tokens of input plus 200+ tokens of Claude’s clarifying questions plus your typing time. The fetch is always cheaper, and it is more accurate — your Notion page says exactly what you intended, not a conversational approximation of it.

    Layer 3: Project Knowledge Files — Session-Specific Pre-Loading

    For recurring project work, a project knowledge file is a curated document that contains everything Claude needs to be immediately productive on that project: the brief, the audience, the tone guidelines, the existing content structure, the decisions already made, the open questions. Loaded at the start of a project session, it replaces 10–15 minutes of orientation with 30 seconds of file loading.

    The project-knowledge-builder skill generates these files automatically for WordPress sites — pulling existing posts, categories, brand voice, SEO context, and site history into a structured document. The same pattern applies to any recurring project: client accounts, content series, product builds, research projects.

    The Concentrated Output Connection

    Cold start elimination and output compression work together. When Claude starts a session already knowing the context, it can skip the exploratory phase and go straight to the task. When you’ve defined in CLAUDE.md that you want structured outputs — briefings, scored lists, run logs — Claude produces them without the verbose preamble that precedes them in orientation-heavy sessions.

    The Tygart Media daily briefing is the clearest example: the desk spec in Notion defines the output format, the sources, the beat structure, and the run log format. Claude fetches the spec, executes, and produces a structured briefing page. No orientation. No format negotiation. No verbose preamble. Every token is productive output.

    Implementation Steps

    1. Audit your last 10 Claude sessions. For each one, identify the first message where Claude produced genuinely useful output. Everything before that is cold start cost. Measure it.
    2. Write your CLAUDE.md. Start with the context you typed most often in those 10 sessions. One hour of writing recovers itself within days.
    3. Create one project knowledge file for your highest-frequency project. Use it for one week and compare session start times and output quality against the prior week.
    4. Set up Notion logging. At the end of each session, have Claude write a 3–5 sentence log entry: what was done, what decisions were made, what the next session needs to know. Store in a Notion database. Fetch at the start of the next session.

    The cold start problem is the most invisible Claude cost because it feels like normal conversation. Once you measure it, it becomes obvious. Once you eliminate it, you cannot go back.

    Part of the Claude on a Budget series.

  • Claude on a Budget: The Complete Guide to Maximum Output at Minimum Token Cost

    Claude on a Budget: The Complete Guide to Maximum Output at Minimum Token Cost

    Last refreshed: May 15, 2026

    The price of a Claude Opus 4.7 token is $25 per million output tokens. In India, that translates to roughly ₹16,800 per month for a Pro subscription — priced at US dollar rates with no regional adjustment. You cannot change that number. What you can change is how many tokens you spend to get the same result, how often you reach for the expensive model when a cheaper one would do, and how much context you burn re-warming Claude on things it already knows.

    This guide is the pillar for the Claude on a Budget cluster on Tygart Media. Every tactic below has a dedicated deep-dive article linked from here. The core insight running through all of it: the biggest Claude cost savings are not about using Claude less — they are about using Claude smarter. The goal is the same output quality at a fraction of the token spend.

    The 7 Levers That Actually Move the Number

    1. Eliminate the Cold Start — Build a Second Brain

    Every time you start a Claude session without pre-loaded context, you pay tokens to re-warm it: who you are, what you’re building, what decisions you’ve already made, what your brand voice sounds like. A well-architected second brain — Notion pages, CLAUDE.md files, project knowledge files — eliminates that cost entirely. Claude starts knowing what matters. The first token of every session is productive, not orientation. Full guide: The Cold Start Problem →

    2. Route by Task — Don’t Default to Opus

    Claude Haiku 4.5 is roughly 30× cheaper per token than Claude Opus 4.7. For sorting, classification, summarization, first-pass triage, and simple Q&A, Haiku delivers quality that is indistinguishable from Opus at the task level. The decision tree: Haiku for speed and volume, Sonnet 4.6 for mid-tier reasoning and writing, Opus 4.7 only when the task genuinely requires maximum capability. Most workflows over-use Opus by a factor of 3–5×. Full guide: Model Routing 101 →

    3. Use OpenRouter as the Budget Orchestration Layer

    OpenRouter gives you a single API that routes to Claude, GPT-4o, Gemini Flash, Llama, Mistral, and dozens of free-tier models through one endpoint. The practical workflow: use a free or near-free model for first-pass sorting and filtering, route only the items that pass the filter to Claude for reasoning and synthesis. You pay Opus prices for 20% of the work and get Opus-quality output on the parts that matter. Full guide: OpenRouter as the Budget Layer →

    4. Run Non-Urgent Work Through the Batch API

    Anthropic’s Batch API processes requests asynchronously and costs 50% less than the standard API at every model tier. Any work that does not need an immediate response — content generation, classification runs, analysis jobs, report generation — should run through the Batch API. The only cost is latency: batches complete within 24 hours. For most content and automation workflows, that trade is straightforwardly worth it. Full guide: The Batch API →

    5. Cache Your Repeated Context

    Anthropic’s prompt caching reduces the cost of repeated context by up to 90% on cached tokens. If you send the same system prompt, knowledge base, or skill file at the start of every session, caching means you pay full price once and a fraction on every subsequent call. The math compounds quickly: a 10,000-token system prompt sent 100 times costs 10× less with caching than without. Most people running Claude at scale are not using this. Full guide: Prompt Caching →

    6. Write Concentrated Outputs — Not Full Meals

    The single biggest controllable output cost is verbosity. A Claude response that delivers the same information in 200 tokens costs one-fifth as much as one that delivers it in 1,000. Structured output formats — scored lists, run logs, briefings, decision tables — deliver more actionable signal per token than open-ended prose. The discipline of asking for concentrated slices instead of full meals is the fastest zero-cost saving available to any Claude user. Full guide: Output Compression →

    7. Shape Content for the Model That Will Cite It

    Claude, ChatGPT, and Perplexity cite completely different types of pages. Claude concentrates on factual, access-related, answer-first content. ChatGPT spreads across comparison and geographic content. Perplexity favors research-flavored deep dives. If you are creating content that you want AI assistants to surface, writing for all three models equally is inefficient — you spend more words getting cited less. Shaping content to match the citation pattern of your target model gets more traction at lower content cost. Full guide: Per-Model Content Shaping →

    The Numbers Behind These Levers

    ModelInput (per 1M tokens)Output (per 1M tokens)Best for
    Claude Haiku 4.5$1.00$5.00Triage, classification, simple Q&A
    Claude Sonnet 4.6$3.00$15.00Writing, mid-tier reasoning, content
    Claude Opus 4.7$5.00$25.00Complex reasoning, architecture, security
    Batch API (any tier)50% off50% offAny non-urgent async work
    Prompt cache hit~90% offn/aRepeated system prompts / knowledge bases

    A workflow that currently runs Opus on every call, sends the same system prompt uncached, and generates verbose prose responses could realistically cut its token spend by 70–85% by applying all seven levers — without any reduction in output quality on the tasks that matter.

    Who This Is For

    This cluster was built with three audiences in mind: Indian developers and teams facing US-dollar Claude pricing on local-currency budgets; independent creators and small teams who cannot justify enterprise-tier spend; and anyone running Claude at scale in production who wants to stop leaving money on the table. The tactics work regardless of where you are — but they matter most where the price-to-income ratio is highest.

    Every article in this cluster is self-contained and actionable. Start with whichever lever applies to your situation, or read them in order if you are building a Claude stack from scratch.

  • Claude Opus 4.7 Is Secretly ~40% More Expensive Than Opus 4.6 — Here’s Why

    Claude Opus 4.7 Is Secretly ~40% More Expensive Than Opus 4.6 — Here’s Why

    Last refreshed: May 15, 2026

    Model Accuracy Note — Updated May 2026

    Current flagship: Claude Opus 4.7 (claude-opus-4-7). Current models: Opus 4.7 · Sonnet 4.6 · Haiku 4.5. This article compares Claude Opus 4.7 pricing to Opus 4.6 as a historical baseline. Opus 4.7 is the current flagship. Both models share the $5/$25.00 per MTok list price.. See current model tracker →

    Anthropic announced Claude Opus 4.7 with the same list pricing as Opus 4.6: $5 per million input tokens, $25 per million output tokens. What Anthropic did not announce — and what Simon Willison surfaced through direct tokenizer analysis — is that Opus 4.7 generates approximately 1.46× more tokens for the same text output as Opus 4.6. That is a ~40% real-world cost increase at unchanged list prices.

    This is not a criticism of the model. Opus 4.7 is genuinely better — 3× higher vision resolution, a new xhigh effort level, improved instruction following, higher-quality interface and document generation. The performance gains are real. The cost increase is also real, and it is not being communicated transparently in Anthropic’s pricing documentation. If you are budgeting for Claude API usage, you need to account for this.

    What Token Inflation Means

    Token inflation occurs when a model generates more tokens to express the same semantic content. It happens for several reasons: more detailed reasoning traces, more verbose explanations, additional caveats and structure, or architectural changes in how the model constructs its output. Opus 4.7 appears to produce more elaborated, structured responses than 4.6 by default — which accounts for the 1.46× multiplier.

    The practical effect: if you were spending $10,000/month on Opus 4.6 for a production application, the same application workload on Opus 4.7 costs approximately $14,600/month — before any intentional use of the new xhigh effort level, which adds further token consumption on top of the baseline inflation.

    How to Measure Your Actual Exposure

    Do not estimate — measure. Here is the four-step process:

    1. Pull your last 30 days of Anthropic API usage data from your platform dashboard. Note your average output token count per call for your primary workloads.
    2. Run a representative sample of those same workloads on Opus 4.7 using the API directly, with identical prompts and system messages. Log output token counts for each call.
    3. Calculate your actual multiplier — it may be higher or lower than 1.46× depending on your specific prompt patterns and use cases. Tasks with highly constrained output formats (structured JSON, fixed-length summaries) will see lower inflation than open-ended generation.
    4. Apply the multiplier to your budget model and adjust your spend projections before migrating production workloads to Opus 4.7.

    Mitigation Strategies

    Several approaches can reduce the cost impact while preserving Opus 4.7’s quality gains:

    • Explicit length constraints in system prompts. Adding “Respond in 200 words or fewer” or “Use bullet points, not paragraphs” constraints does not reduce quality on most tasks but meaningfully constrains token generation. Test which of your prompts accept length constraints without quality loss.
    • Model routing by task type. Use the new gateway model picker in Claude Code, or implement explicit routing in your API calls: Opus 4.7 for the tasks where quality genuinely requires it, Sonnet 4.6 or Haiku 4.5 for high-volume tasks where speed and cost matter more than peak quality. The cost difference between Haiku and Opus is roughly 30×.
    • Avoid xhigh effort unless necessary. The new xhigh effort level in Opus 4.7 consumes significantly more tokens than the default effort setting. Reserve it for tasks where maximum quality is genuinely required — complex reasoning, high-stakes code generation, detailed document analysis. Do not set it as a default.
    • Evaluate Sonnet 4.6 for your use case. For many production workloads, Claude Sonnet 4.6 at $3/$15 per million tokens delivers quality that is indistinguishable from Opus 4.7 at the task level. The Opus tier is most clearly differentiated on the most difficult tasks — extended chain-of-thought reasoning, complex multi-step coding, nuanced creative judgment. Benchmark your specific workloads before assuming Opus is required.

    The Transparency Gap

    Anthropic’s pricing page lists token costs accurately. What it does not document is how output token counts change across model versions for equivalent tasks. This is an industry-wide gap, not an Anthropic-specific failing — no major AI provider documents per-task token consumption differences between model versions in their pricing documentation.

    The practical implication for any team managing AI infrastructure: treat “same price per token” announcements as partial information. Always benchmark your actual workloads on new model versions before migrating production traffic. The 1.46× multiplier Willison measured is for general text — your specific workload multiplier will be different, and you need to know it before your invoice arrives.

    Claude Opus 4.7 is available now through the Anthropic API at platform.claude.com. API pricing: $5/M input tokens, $25/M output tokens. Measure before you migrate.