Tag: AI Comparison

  • What We Learned Querying 54 LLMs About Themselves (For $1.99 on OpenRouter)

    What We Learned Querying 54 LLMs About Themselves (For $1.99 on OpenRouter)

    The headline: In mid-May 2026, we ran an autonomous OpenRouter session querying 54 LLMs about their own identity, capabilities, and training. Total cost: $1.99 against a $270 starting balance. 43 substantive responses, 10 documented failures, 1 reasoning-only response. The most interesting finding: aion-2.0 identified itself as Claude — concrete evidence of training-data identity inheritance across LLMs. This article walks through the methodology, the reliability data, and what cheap multi-model research now makes possible.

    This is part of our OpenRouter coverage. For the operator’s view on why we run model research through OpenRouter, see the field manual. For the structured decision methodology that multi-model setups also enable, see the roundtable methodology.

    The setup

    In mid-May 2026 we ran an autonomous session designed to extract self-knowledge from a wide sample of available LLMs. The question structure was simple: ask each model about its own identity, training, capabilities, and limits, then capture the response for cross-comparison.

    The scope expanded mid-execution from the original 50 to 54 models — the OpenRouter catalog had grown during the session itself, which is its own data point about how fast this ecosystem moves.

    The architecture: a Python script with parallel bash execution, a max-wait timeout per model, graceful per-provider error handling, and Notion publishing of each model’s response as a separate Knowledge Lab entry. Everything billed through OpenRouter.

    The cost: $1.99 against a $270 starting balance. Less than two dollars to canvas 54 frontier and near-frontier models on a question of self-identity.

    The hit rate

    Of 54 models queried, 43 returned substantive responses. One returned a reasoning trace without final content (GPT-5.5 Pro, which we counted as a valid capture given the reasoning content was the interesting part). 10 returned documented failures.

    That’s 81% substantive completion. For a fully autonomous run against a heterogeneous provider pool with no per-model tuning, that’s a meaningful number.

    The 10 failures broke down into clear categories:

    • Rate limiting (429 errors): persistent on a handful of providers. Some had genuine quota issues; some appeared to be hitting upstream limits we couldn’t see from our side.
    • Forbidden (403): providers refusing the request entirely, often for reasons related to account configuration we hadn’t completed.
    • Not found (404): model IDs that had moved or been deprecated between our model-list scrape and the execution.
    • Timeouts: the most interesting category. Grok 4.20 multi-agent consistently exceeded our timeout window — not because it was slow, but because it appears to orchestrate sub-agents that genuinely take more than 40 seconds to produce a final answer. We documented this as a failure for our purposes; for a different use case it would have been a feature.

    The decision we made in real time was not to retry persistent failures. If a provider returned 429 on three consecutive attempts, we let it stand as a documented failure rather than burning the run on retries. The rationale: those providers are either genuinely rate-limited or having an issue, and a fourth attempt in the same minute isn’t going to resolve either.

    The finding that mattered

    Of all the substantive responses, one stood out: aion-2.0 identified itself as Claude.

    Not “trained on Claude data.” Not “fine-tuned from a Claude-derived model.” It described itself, in the first person, as Claude.

    Aion-2.0 is not Claude. It’s a separate model from a separate provider. The most likely explanation is that its training data included a significant volume of Claude outputs, and the model’s self-knowledge inherited Claude’s identity along with Claude’s content patterns. The model learned to be Claude-like in style and, in the process, learned to identify as Claude in substance.

    This is a known phenomenon in the literature on training data contamination, but seeing it surface concretely in a production model — on an answer to a basic self-identity question — is different from reading about it in a paper. It’s a real thing happening at scale, and most users of these models have no idea.

    The implication for anyone running multi-model evaluations: model outputs are not independent. Models trained on the outputs of other models inherit not just style but identity, opinion patterns, and likely failure modes. If you’re running a roundtable methodology and treating three models as three independent perspectives, and one of them is silently downstream of another in training data, your “consensus” might be one model’s perspective dressed in three different costumes.

    This is also an argument for why first-party model selection — choosing models from clearly distinct lineages rather than just “three frontier models” — matters more than people give it credit for.

    The reliability data

    Setting aside the aion-2.0 finding, the bare reliability data from this run is useful on its own terms.

    10 of 54 providers (18.5%) returned errors. That’s a meaningful failure rate for any production workload that depends on cross-model availability. If your application assumes you can call any model in the catalog and get a response, you’re going to be wrong about 1 in 5 of the time on first attempt.

    OpenRouter’s pooled access mitigates this somewhat — for some providers, OpenRouter automatically retries against alternate endpoints when one fails. But the failures we saw were after OpenRouter’s own retry logic ran. These are the failures that surface to the caller after the routing layer has done what it can.

    For production systems, the practical implication is straightforward: never depend on any single model being available. Build fallback chains. Use OpenRouter’s Auto Router with a wildcard allowlist for tolerance, or wire your own fallback logic. A multi-model architecture isn’t a luxury; it’s a reliability requirement.

    The cost shape

    $1.99 of spend across 54 model queries works out to roughly $0.037 per query, including all the failed attempts.

    That’s the headline number, but the distribution matters more than the average. A handful of queries — the ones that hit larger reasoning models like Claude Opus or GPT-5.5 Pro — accounted for the majority of the spend. Cheap models like Gemini Flash and various open-source mid-tier models barely moved the needle.

    If you’re running research at this kind of breadth, the cost model is dominated by the heavy reasoning models, not by the long tail of cheaper models. The implication: when you’re running broad-canvas queries, it costs almost nothing to add another cheap model to the catalog. Adding another expensive reasoning model is what you should be deliberate about.

    What broke and what we learned

    Three patterns of failure repeated:

    Provider rate limits unrelated to our usage. Some providers appear to share upstream capacity with the wider OpenRouter user base, and when that upstream capacity is hot, your individual call fails regardless of your own usage. There is no client-side fix. You either retry later or fall back.

    Model IDs drift. The catalog moves fast. A model ID you fetch on Monday may have been deprecated by Friday. Our script’s freshness window — about a day between model-list scrape and execution — was sometimes enough for drift. For production systems, fetch the model list immediately before the run.

    Multi-agent models exceed simple timeout windows. Grok 4.20’s behavior of orchestrating sub-agents that take 40+ seconds is not a bug; it’s the product. But it breaks any timeout shorter than what the multi-agent run actually needs. If you’re going to call multi-agent models, plan for long latencies and don’t share a timeout policy with single-call models.

    What we’d do differently

    Three changes for the next run of this kind:

    1. Refresh the model list inline. Don’t trust a list scraped even a few hours earlier. Fetch fresh before each batch.
    2. Tiered timeouts. Single-call models on a tight timeout. Multi-agent and reasoning-heavy models on a relaxed one. Detect which is which from the model metadata where possible.
    3. Publish-as-you-go. Our Notion publish step ran after data collection. The session ended mid-publish, leaving uncertainty about which of the 54 pages had actually been created. Better to publish each result immediately as it returns, so a session interruption doesn’t lose anything.

    The bigger lesson

    Two dollars to canvas 54 models on a question of self-identity is a cost structure that didn’t exist three years ago. It also means a category of research that used to require expensive infrastructure is now within reach of anyone with an OpenRouter account and a Python script.

    The interesting finding — aion-2.0 silently identifying as Claude — would have been almost impossible to discover any other way. You can’t catch a training-data identity inheritance by reading model documentation. You catch it by asking a lot of models the same question and looking at the answers side by side.

    OpenRouter, for all its caveats and its limited scope, makes this kind of multi-model research tractable in a way nothing else currently does. If you’re not running periodic broad-canvas queries against your model catalog, you’re flying blind on what’s actually in there. Two dollars is cheap insurance against being surprised by the next aion-2.0.

    Frequently asked questions

    How much does it cost to query 54 LLMs at once via OpenRouter?

    In our autonomous run, the total cost was $1.99 — roughly $0.037 per query including the 10 failed attempts. Cost was dominated by the few queries hitting expensive reasoning models like Claude Opus and GPT-5.5 Pro; the long tail of cheaper models barely moved the needle. Adding more cheap models to a broad-canvas query costs almost nothing.

    What is training-data identity inheritance?

    When a model’s training data includes outputs from another model, the trained model can inherit not just style but identity from the source model. In our run, aion-2.0 identified itself as Claude — likely because its training data contained enough Claude outputs that the model’s self-knowledge absorbed Claude’s identity along with Claude’s content patterns. This is a known phenomenon in the literature on data contamination.

    How reliable are LLM providers via OpenRouter?

    In our 54-model autonomous run, 10 providers (18.5%) returned errors after OpenRouter’s own retry logic ran. The failures broke down into rate limits, forbidden responses, deprecated model IDs, and timeouts on multi-agent models. The practical implication: never depend on any single model being available. Build fallback chains.

    Why did some models timeout in the 54-LLM run?

    The most notable timeout case was Grok 4.20 multi-agent, which appears to orchestrate sub-agents that genuinely take more than 40 seconds to produce a final answer. This isn’t a bug; it’s the product. But it breaks any timeout policy shared with single-call models. Multi-agent and reasoning-heavy models need their own relaxed timeout tier.

    Should I run periodic broad-canvas queries against my model catalog?

    Yes. At roughly two dollars per 54-model run, broad-canvas queries are cheap insurance against being surprised by training-data inheritance, identity drift, or quality degradation in models you depend on. You can’t catch these issues by reading documentation. You catch them by querying widely and comparing answers side by side.

    See also: The 5-Layer OpenRouter Mental Model: Org, Workspace, Guardrail, Key, Preset

  • The Multi-Model AI Roundtable: A Three-Round Methodology for Better Decisions

    The Multi-Model AI Roundtable: A Three-Round Methodology for Better Decisions

    The Multi-Model AI Roundtable is a three-round structured exchange where the same question is sent to three models from different lineages (typically Claude, GPT, and Gemini), cross-pollinated by sharing each model’s response with the others, and then synthesized into a final recommendation with explicit confidence calibration. Used for strategic decisions, content architecture, and technical trade-offs where single-model output isn’t trustworthy enough.

    This is part of our OpenRouter coverage. See the operator’s field manual for the broader context on why we route through OpenRouter, and the 5-layer mental model for the hierarchy that makes multi-model routing tractable.

    Why three models beat one

    Single-model decision-making has a known failure mode: the model’s training data and reasoning patterns silently shape every recommendation. The model doesn’t know what it doesn’t know. You don’t know what it doesn’t know. You get a confident answer, you act on it, and the missing perspective shows up later as a problem you didn’t see coming.

    Three models from three different lineages catch each other’s blind spots. Claude Opus 4.7 tends to over-index on safety considerations and structural rigor. GPT-5.5 tends to favor decisive, action-oriented framing. Gemini 3 Flash tends to surface edge cases and multimodal context the others gloss over. Run a hard decision past all three and the agreement-versus-disagreement pattern itself becomes information.

    The methodology we use is a three-round structured exchange. Same question, three responses, then cross-pollination, then synthesis. Below is the exact pattern we’ve used across decisions ranging from tech stack choices to keyword prioritization to architectural calls on the autonomous behavior system.

    The architecture

    OpenRouter makes this cheap to wire. One API endpoint, three different model identifiers, three parallel calls:

    const models = [
      "anthropic/claude-opus-4.7",
      "openai/gpt-5.5",
      "google/gemini-3-flash"
    ];
    
    const responses = await Promise.all(
      models.map(model =>
        fetch("https://openrouter.ai/api/v1/chat/completions", {
          method: "POST",
          headers: {
            "Authorization": `Bearer ${OPENROUTER_API_KEY}`,
            "Content-Type": "application/json"
          },
          body: JSON.stringify({
            model,
            messages: [{ role: "user", content: prompt }]
          })
        }).then(r => r.json())
      )
    );
    

    That’s the entire architectural surface. Three calls, three responses, parallel execution. Without OpenRouter you’d be juggling three separate API contracts. With it, one endpoint and a model parameter.

    Round 1: Individual perspectives

    Send the same question to all three models with no awareness that they’re part of a roundtable. Each responds independently.

    The prompt structure that works:

    We’re evaluating [decision]. Consider:

    1. The key factors to weigh
    2. Risks and mitigations
    3. Your recommendation, with reasoning
    4. What you might be missing

    The fourth bullet is the one that earns the cost of the call. Asking a model to name its own blind spots is a remarkably effective way to surface the limits of its perspective. Models that handle this prompt well will name epistemic limits explicitly: “I don’t have visibility into your team’s specific constraints,” or “this depends on factors I can’t verify from this conversation.”

    Collect all three Round 1 responses. Don’t synthesize yet.

    Round 2: Cross-pollination

    This is where the methodology earns its keep. Send each model the other two models’ Round 1 responses and ask:

    • Identify points of agreement
    • Challenge or refine the other perspectives
    • Update your own recommendation if warranted

    Most teams skip this round. They run Round 1, see agreement, ship a decision. They miss the cases where one model would have changed its mind given the other models’ input — which is exactly the cases where the disagreement matters.

    Round 2 also surfaces a pattern worth naming: model deference. Some models, when shown a different perspective, will pivot toward it almost regardless of the merits. Others hold their position too rigidly. Watching how each model handles disagreement is itself information about how to weight their inputs in future roundtables.

    Round 3: Synthesis

    One model — usually Claude in our case, because long-form reasoning is the job — gets all the Round 1 and Round 2 outputs and produces a final synthesis:

    • Consensus points (where all three models agreed, both rounds)
    • Remaining disagreements (where the models did not converge)
    • Confidence level (high if convergence, medium if mixed, low if persistent disagreement)
    • Suggested next steps

    The confidence calibration is the part that changes how decisions actually get made. A decision the roundtable converges on with high confidence can be acted on immediately. A decision with persistent disagreement is a signal that the question is harder than it looked, and probably needs human judgment or more research before action.

    When this is worth running

    The roundtable is not free. Three rounds, three models, plus synthesis equals roughly four to six API calls per decision. Even at low-cost model pricing for the initial rounds, this adds up if you run it on every micro-decision.

    Use it for:

    • Strategic decisions — tech stack selection, business model choices, pricing strategy
    • Content strategy at scale — keyword prioritization for a 50-article batch, topic cluster architecture, format decisions
    • Technical architecture — system design, security posture, performance trade-offs
    • Anything irreversible — moves that you’ll wear for months if they’re wrong

    Don’t use it for:

    • Day-to-day operational questions a single model can answer well
    • Decisions where you already know the answer and just want validation
    • Questions where the cost of being wrong is small

    Cost shape

    For an agency stack the cost-per-roundtable comes out roughly as follows when using a balanced model mix:

    • Round 1: three parallel calls. Use Gemini 3 Flash or DeepSeek V3.2 for breadth at low cost. Heavier models only when you need deeper reasoning in Round 1.
    • Round 2: three more calls with more context. Same models, larger context window.
    • Round 3: one synthesis call. Use the best reasoning model you have access to — Claude Opus 4.7 is our default for synthesis.

    Total cost per decision typically runs from a few cents to a few dollars depending on context length and model selection. For decisions worth running through the roundtable, that’s noise.

    An example output

    A real roundtable from our archive, on the question of where to start with Google Apps Script as a learning project:

    GPT-5.5: Start simple — a Google Sheets data retrieval script. Learning value comes from working through the auth flow and basic API surface without complexity getting in the way.

    Claude Opus 4.7: Start impactful — a Time Insight Dashboard combining Gmail and Calendar data. Higher learning curve but produces something you’ll actually use, which keeps motivation up.

    Gemini 3 Flash: Hybrid — simple foundation but with one meaningful integration. Lowers the activation energy while preserving the impact angle.

    Consensus (Round 3): Begin with a data retrieval script (all three models agree on the learning value) but include one meaningful integration like calendar events. The Round 2 cross-pollination resolved most of the disagreement; Claude moderated its position after seeing GPT-5.5’s argument about activation energy.

    Confidence: High. All three models aligned on progressive complexity after cross-pollination.

    That output is more useful than any single model’s recommendation would have been. It names the trade-off, shows the path to consensus, and quantifies confidence. That’s what you’re paying for.

    The variations worth knowing

    A few patterns we’ve adapted from the base methodology:

    Adversarial roundtable. Instead of asking each model the same question, assign roles. Model A argues for. Model B argues against. Model C judges. Useful for decisions where you suspect you’ve already made up your mind.

    Sequential expert chain. Skip parallel Round 1. Run one model, then send its output to the next model to refine, then to the third. Slower but useful when you need each step to build on the last.

    Domain-specialized roundtable. Use BYOK to route Round 1 calls to specialty providers when the question is technical. A legal question routes through a legal-specialized provider. A code question routes through a code-specialized provider. The synthesis still happens at Claude Opus 4.7 or GPT-5.5.

    The base methodology — three rounds, three models, one synthesis — is the version we run by default. The variations are for cases where the base pattern is leaving value on the table.

    What this unlocks

    Once the roundtable is wired into your stack, a category of decision that used to take a meeting becomes a 90-second API call. Not every meeting. The ones where you would have walked in already knowing the answer and the meeting was performative.

    The roundtable doesn’t replace human judgment. It replaces the version of the decision where you didn’t think it through. The version where you would have shipped your first instinct and lived with the consequence. That’s the win.

    Frequently asked questions

    What is a multi-model AI roundtable?

    A three-round structured exchange where the same question is sent to three AI models from different lineages, then cross-pollinated by sharing each model’s response with the others, then synthesized into a final recommendation with explicit confidence calibration. The methodology surfaces blind spots that single-model output silently hides.

    Why use Claude, GPT, and Gemini together instead of just one?

    Each model has different training data and reasoning patterns. Claude tends to emphasize safety and structural rigor. GPT tends to favor decisive action-oriented framing. Gemini tends to surface edge cases. Running a hard decision past all three gives you agreement-versus-disagreement information that no single model can provide.

    How much does a multi-model roundtable cost per decision?

    Typically a few cents to a few dollars per decision, depending on model selection and context length. Using cheaper models (Gemini Flash, DeepSeek) for the initial rounds and reserving the expensive reasoning models for Round 3 synthesis keeps the cost shape favorable.

    When is the multi-model roundtable not worth running?

    Skip it for day-to-day operational questions a single model can answer well, decisions where you already know the answer and just want validation, and questions where the cost of being wrong is small. Reserve it for strategic decisions, content architecture, technical trade-offs, and anything irreversible.

    What is the third round of the roundtable for?

    Synthesis. One model — typically the strongest reasoning model in the set — receives all the Round 1 and Round 2 outputs and produces a final recommendation with consensus points, remaining disagreements, confidence level, and suggested next steps. This is the part that turns three opinions into one actionable decision.

    See also: What We Learned Querying 54 LLMs About Themselves (For $1.99 on OpenRouter)

  • Notion AI vs Zapier AI: Which Automation Layer Wins For Your Use Case

    Notion AI vs Zapier AI: Which Automation Layer Wins For Your Use Case

    Notion AI vs Zapier AI: Which Automation Layer Wins For Your Use Case

    The 60-second version

    Zapier and Notion AI overlap in concept (automate routine work) but optimize for different operators. Zapier: massive integration catalog, no-code, simple triggers and actions, optimized for “if this, then that” patterns. Notion AI: AI reasoning native, deep workspace context, optimized for “decide what to do given context, then act.” Use Zapier for breadth of simple automations. Use Notion Agents for depth of reasoning. The two are complementary.

    When Zapier wins

    • You need many simple automations across many apps
    • Non-technical operators need to build automations themselves
    • The trigger logic is straightforward (if X, do Y)
    • You don’t have or want AI reasoning in the loop
    • You’re not heavily invested in Notion as a platform

    When Notion Agents win

    • The workflow requires understanding Notion workspace content
    • AI reasoning about whether and how to act matters
    • Schedule-driven autonomous work is the goal
    • The workflow output is in Notion or affects Notion data
    • You want agents that can compose multi-step reasoning

    What Zapier does that Notion Agents don’t

    • Thousands of app integrations out of the box
    • Visual no-code building accessible to non-developers
    • Flat-rate pricing easier to budget
    • Established for years; lots of recipes and patterns

    What Notion Agents do that Zapier doesn’t

    • AI reasoning native to the workflow
    • Workspace context understanding
    • Skills (natural-language workflow definitions)
    • Workers for custom code
    • Database fluency at the platform level

    The combined pattern

    Many operators use both:
    – Zapier for cross-app plumbing (lead from form → CRM → Slack → email)
    – Notion Agents for workspace reasoning (synthesize lead context, decide priority, draft response)
    – Sometimes Zapier triggers a Notion agent run
    Treat them as layers: Zapier moves data; Notion Agents make decisions about that data.

    Where this goes wrong

    1. Trying to use Zapier for AI reasoning. Zapier has AI features but they’re shallow compared to Notion Agents.
    2. Trying to use Notion Agents for cross-app plumbing. Possible via Workers/MCP, but Zapier’s integration catalog is broader.
    3. Picking based on price alone. The right tool for the job costs less than the wrong tool, even at higher per-task pricing.

    What to read next

    Notion Agents vs n8n Alone, n8n MCP Bridge, Workers + External APIs, AI-Native Company Patterns.

  • Notion AI vs Microsoft Copilot: Two Philosophies of Embedded AI

    Notion AI vs Microsoft Copilot: Two Philosophies of Embedded AI

    Notion AI vs Microsoft Copilot: Two Philosophies of Embedded AI

    The 60-second version

    The choice is philosophical, not feature-by-feature. Notion AI says: “build your work in one structured workspace and let AI flow through everything.” Microsoft Copilot says: “use the tools you already use and let AI sit inside each one.” Both are valid. Both work. Which fits depends on whether your team’s pattern is consolidated workspace or distributed productivity suite.

    When Notion AI wins

    • You want one unified workspace
    • Custom Agents and scheduled autonomous work matter
    • Database-driven workflows and Autofill are core
    • Smaller teams (under ~200) where Notion’s collaboration model fits
    • Teams that haven’t deeply invested in Microsoft 365

    When Microsoft Copilot wins

    • You’re already deep in Microsoft 365
    • Excel-heavy analysis is core to your workflow
    • Outlook + Teams is your primary collaboration surface
    • Enterprise IT requirements favor Microsoft (compliance, identity, security)
    • Larger orgs where Microsoft’s enterprise plumbing matters

    What Copilot does that Notion AI doesn’t

    • Native deep integration into Excel, Word, PowerPoint, Outlook, Teams
    • Enterprise identity and compliance posture (Azure AD, Purview)
    • Strong Excel-native data analysis with formula generation
    • Teams meeting transcription and recap as a primary surface

    What Notion AI does that Copilot doesn’t

    • Custom Agents running on schedules
    • Workers for code execution
    • The Notion-style structured knowledge graph
    • MCP and n8n integrations
    • More flexible workspace shape

    The IT-procurement layer

    Larger organizations often have IT and procurement preferences that drive this decision more than feature comparison. Microsoft enterprise contracts, identity integration, and compliance posture are real factors. Notion’s enterprise story is improving but Microsoft has decades of head start in that lane.

    Where comparisons go wrong

    1. Comparing feature lists in isolation. Real value is integration depth into the platform you actually use.
    2. Underestimating Microsoft’s enterprise plumbing. For large orgs, identity and compliance are not afterthoughts.
    3. Underestimating Notion’s flexibility. For smaller teams, Notion’s malleability beats Microsoft’s rigidity.

    What to read next

    Notion AI vs Gemini, Notion AI vs ChatGPT, Editorial Surface Area, AI-Native Company Patterns.

  • Notion AI vs Gemini for Workspaces: The Document AI Showdown

    Notion AI vs Gemini for Workspaces: The Document AI Showdown

    Notion AI vs Gemini for Workspaces: The Document AI Showdown

    The 60-second version

    Most “Notion AI vs Gemini” comparisons miss the actual decision: which platform does your work live in? If you’re a Notion-first team, Notion AI is the integrated answer. If you’re a Google Workspace team, Gemini integrates more deeply into Docs, Sheets, Slides, and Gmail than any third-party AI will. Trying to use both heavily creates context-splitting problems. Pick the platform first. The AI follows.

    When Notion AI wins

    • Your work lives in Notion (databases, pages, agents)
    • You use Custom Agents on schedules
    • Cross-source synthesis across Notion + connected sources matters
    • Database manipulation and Autofill is core to your workflow
    • Multi-app integration via MCP and Workers

    When Gemini for Workspace wins

    • Your work lives in Google Docs, Sheets, Slides
    • Real-time multi-user document collaboration is dominant
    • Email and calendar are the primary surfaces (Gemini’s Gmail integration is strong)
    • Sheets-heavy analysis benefits from Gemini’s native data understanding
    • You’re already paying for Google Workspace

    The stacking question

    Some teams run both. Three patterns that work:
    1. Notion as second brain, Google as collaboration layer. Notion holds structured knowledge; Google holds in-flight collaborative docs.
    2. Notion as agent layer, Google as document factory. Notion runs the agents and synthesis; Google produces the actual docs that get sent.
    3. Drive integration as the bridge. Notion AI reads Google Drive content via integration so the agent can synthesize across both surfaces.

    What Gemini does that Notion AI doesn’t

    • Real-time multi-user editing with AI assistance
    • Sheets-native analysis and chart generation
    • Deep Gmail integration
    • Slides-native design and image generation

    What Notion AI does that Gemini doesn’t

    • Scheduled autonomous agents (Custom Agents)
    • Database property Autofill at the workspace level
    • Workers for code execution
    • The Notion-style structured knowledge graph
    • MCP-based tool integration

    Where comparisons go wrong

    1. Treating raw model quality as the deciding factor. Both use strong models. Integration depth matters more.
    2. Underestimating switching costs. Moving an org for AI reasons is rarely worth it.
    3. Trying to use both heavily. Context splits. Synthesis suffers.

    What to read next

    Notion AI vs ChatGPT, Notion AI vs Microsoft Copilot, Editorial Surface Area, Google Drive Integration.

  • Notion AI vs ChatGPT for Daily Knowledge Work

    Notion AI vs ChatGPT for Daily Knowledge Work

    Notion AI vs ChatGPT for Daily Knowledge Work

    The 60-second version

    This isn’t a winner-take-all comparison. Notion AI and ChatGPT are different categories of tool that get incorrectly compared because they both use the word “AI.” Notion AI knows your workspace. ChatGPT knows the open web. The right operator stack uses both. The question isn’t which to pick; it’s how to route work between them.

    When Notion AI wins

    • Anything that requires knowing your specific content
    • Synthesis across your databases, pages, and connected sources
    • Document work where the doc lives in your workspace
    • Recurring tasks that benefit from agent automation
    • Mobile use where seamless integration matters

    When ChatGPT wins

    • Open-web research
    • Brainstorming on topics outside your workspace
    • Code generation (currently ChatGPT and Claude lead here)
    • General-purpose Q&A
    • Conversational exploration of ideas

    How they stack

    The pattern that works for most operators: ChatGPT for “thinking out loud” and external research; Notion AI for everything that touches your actual work. Use ChatGPT to draft an idea, then move the polished version into Notion where it joins your actual workspace and Notion AI takes over.

    What ChatGPT does that Notion doesn’t (yet)

    • Image generation
    • Voice conversations as a primary mode
    • Custom GPT marketplace
    • Data analysis on uploaded files at scale

    What Notion AI does that ChatGPT doesn’t

    • Persistent context across your workspace
    • Database manipulation and Autofill
    • Custom Agents running on schedules
    • Workers for code execution
    • Native integration with Slack, Mail, Calendar at the workspace level

    The pricing reality

    ChatGPT Plus is $20/month per user. Notion Business is $20/user/month annually with separate Custom Agent credits ($10/1000) starting May 4. For a team using both heavily, the combined cost is meaningful.

    Where comparisons go wrong

    1. Asking “which is smarter.” They use overlapping models. Raw model intelligence is similar; what differs is integration depth.
    2. Trying to pick one. The right answer is usually both, with clear use-case routing.
    3. Treating ChatGPT memory as equivalent to Notion’s workspace context. ChatGPT memory is conversational. Notion’s context is structured workspace data. Different categories.

    What to read next

    Notion AI vs Claude Projects, Notion AI vs Gemini, Editorial Surface Area, Auto Model Selection.

  • Notion AI vs Claude Projects: Which Belongs in Your Stack

    Notion AI vs Claude Projects: Which Belongs in Your Stack

    Last refreshed: May 15, 2026

    Update — May 15, 2026: Two things have shifted since this article was originally written. First, Claude Opus 4.7 (released April 2026) is now Anthropic’s most capable model with a 1M token context window at standard pricing — which changes the calculus for any task involving large documents or long-form reasoning, where Claude was already the stronger choice. Second, on May 13, 2026, Notion shipped the Notion Developer Platform with Claude as a launch partner, which means the comparison is no longer just “Notion AI vs Claude Projects” — Claude can now operate natively inside Notion via the External Agents API. For the platform launch breakdown, see Notion Developer Platform Launch (May 13, 2026). For the current Claude model lineup, see Claude Models Roadmap May 2026. For how this fits into a working stack, see The Three-Legged Stack.

    Notion AI vs Claude Projects: Which Belongs in Your Stack

    The 60-second version

    Notion AI and Claude Projects both let you bring custom context to AI. The difference is what surrounds the AI. Notion AI lives inside a workspace with databases, integrations, schedules, and a team. Claude Projects lives inside a conversation with files, instructions, and the conversation history. For ongoing operational work where the AI needs to be part of how you work, Notion AI fits. For deep focused work where conversation quality is the primary value, Claude Projects fits. Many operators use both.

    When Notion AI wins

    • Persistent operational context across the workspace
    • Custom Agents on schedules
    • Database fluency and Autofill
    • Native integrations (Slack, Mail, Calendar)
    • Team collaboration patterns
    • Mobile and cross-device access

    When Claude Projects wins

    • Deep, focused task work
    • Strong conversation continuity within a topic
    • Specific instruction sets per project
    • File-heavy reference contexts (code, research, large documents)
    • When conversation quality (Claude’s strength) matters more than integration

    The stacking pattern

    The pattern many operators use:
    Notion AI for the ongoing rhythm of work — agents, databases, daily operational synthesis
    Claude Projects for “I need to deeply work on X” sessions — heavy reasoning, complex code, large reference contexts
    The two don’t conflict; they cover different time horizons. Notion AI is always-on background. Claude Projects is intentional focused sessions.

    What Claude Projects does that Notion AI doesn’t

    • File upload context with longer effective memory in-conversation
    • More flexible custom instructions per project
    • Conversation continuity that’s purely Claude-native (no model-switching)

    What Notion AI does that Claude Projects doesn’t

    • Workspace databases and Autofill
    • Scheduled agent execution
    • Native integrations beyond conversation
    • Multi-user collaboration on the same context

    Where comparisons go wrong

    1. Treating them as direct substitutes. They overlap but serve different shapes of work.
    2. Picking based on raw conversation quality alone. That favors Claude. But conversation quality isn’t the whole product.
    3. Picking based on integration breadth alone. That favors Notion. But integration matters more for some workflows than others.

    What to read next

    Notion AI vs ChatGPT, Notion AI vs Gemini, Editorial Surface Area, Custom Agents vs Basic.

  • Auto Model Selection in Notion 3.2: Letting Notion Pick Claude, GPT, or Gemini For You

    Auto Model Selection in Notion 3.2: Letting Notion Pick Claude, GPT, or Gemini For You

    Auto Model Selection in Notion 3.2: Letting Notion Pick Claude, GPT, or Gemini For You

    The 60-second version

    You don’t have to pick the model anymore. Notion 3.2 added auto-selection, which routes each request to the best-fit model from the available pool — currently including Claude Opus 4.7, GPT-5.2, and Gemini 3. Simple tasks (rewrites, summaries, quick drafts) go to faster models. Complex tasks (multi-step reasoning, long-context analysis, tool-heavy agent runs) go to more capable ones. You can override the selection per request, but the default behavior is “let Notion pick” — and for most workflows, that’s the right call.

    Why auto-selection matters

    Three reasons it’s a meaningful shift:
    1. You stop being a model-picker. Before auto-selection, getting good output required knowing which model handled which task best. That’s expert knowledge most users don’t have. Auto-selection internalizes that knowledge.
    2. Cost-performance balance happens automatically. Faster models are cheaper to run; capable models are more expensive. Notion’s auto-selection routes simple work to cheap models and reserves expensive models for tasks that need them. After May 4, when credits start metering Custom Agent work, this matters financially.
    3. Model diversity becomes a feature, not friction. Different models have different strengths. Claude is consistently strong on long-form writing and tool use. GPT is strong on broad reasoning. Gemini is strong on multimodal and certain analytical tasks. Auto-selection uses the right tool without forcing you to know which is which.

    When to override the auto-selection

    Three cases where manual model choice still wins:
    1. You’ve measured a specific preference. If you’ve tested the same task across all three models and found one consistently better for your use case, lock to that one. Auto-selection optimizes for the average user; you may not be the average user.
    2. You’re working in a domain with a clear model strength. Long-form editorial work where Claude’s prose quality is meaningfully better. Code work where GPT’s tool use feels more natural. Visual analysis where Gemini’s multimodal handles your case better.
    3. Reproducibility matters. Auto-selection means today’s request might use Claude and tomorrow’s might use GPT. If you need consistent voice or behavior across runs, lock the model.
    For everything else, auto-selection is fine. Stop optimizing the optimizer.

    What auto-selection isn’t

    It isn’t infinite model access. The pool is curated by Notion. You don’t get every model on the market. You get the ones Notion has integrated and validated for the platform.
    It also isn’t a replacement for model expertise if you’re a developer building on the API. When you build with Workers or skills via the API, you may want explicit model selection because reproducibility matters more there than in interactive use.

    How to verify auto-selection is working

    A 5-minute test:
    1. Open a page with substantive content (a project doc, an article, a meeting transcript)
    2. Run three different prompts: a quick rewrite, a complex synthesis, and a multi-step extraction
    3. Look at the output quality for each
    4. If all three feel right for the task, auto-selection is doing its job
    5. If any feel off — outputs that are too brief or too verbose, missing the task’s complexity — that’s where to consider manual override

    Why Claude Opus 4.7 in particular matters

    The Claude Opus 4.7 addition is worth noting separately. Anthropic’s latest uses fewer tokens (cheaper to run), makes 3x fewer tool errors (more reliable for agents that call Workers), and handles complex workflows better. For Notion specifically, that means agents that previously hit edge cases when chaining multiple skills or Workers now have a more reliable backbone.
    If you’re heavy into Custom Agents and Workers, Opus 4.7 in the rotation is the quiet upgrade that makes everything more dependable.

    What to read next

    Corpus follow-ups: Mobile AI in Notion (where auto-selection also runs), Custom Agents foundation piece (where model selection has cost implications), and the comparison articles (Notion AI vs ChatGPT, Claude Projects, Gemini for Workspaces).

  • Books for Bots: What Happens When You Let Claude Interrogate Your GA4 Data

    Books for Bots: What Happens When You Let Claude Interrogate Your GA4 Data

    For the past several weeks I have been running a live experiment on helpnewyork.com: using Claude-in-Chrome to interrogate Google’s Analytics Advisor inside GA4, session by session, until I had a complete behavioral profile of every AI platform sending traffic to the site.

    What came out of it is not what I expected. I expected traffic data. I got a content strategy.

    The Setup

    Claude-in-Chrome is Anthropic’s browser extension that lets Claude operate directly inside your browser — reading pages, clicking elements, filling inputs, capturing output. Analytics Advisor is Google’s Gemini-powered chat interface built into GA4, available to English-language accounts since December 2025. It answers natural language questions about your property data with charts, tables, and narrative interpretation.

    The combination is unusual. You are using one AI (Claude) to systematically interrogate another AI (Gemini) about your site’s data, then synthesizing what comes back into strategy. The token budget for the heavy data reasoning stays inside Google’s infrastructure. Claude handles the query architecture, the capture protocol, and the synthesis.

    I ran four structured sessions across two sittings, using a specific sequence of queries built to extract progressively deeper signal. Session 1 established baseline traffic. Session 2 closed gaps and confirmed AI referral data existed. Session 3 was the AI deep dive. Session 4 was velocity and geography.

    What the Data Showed

    Three AI platforms were sending meaningful traffic to helpnewyork.com during the 28-day window: ChatGPT, Claude, and Copilot. The behavioral profiles were so different from each other that treating them as a single “AI traffic” segment would have produced wrong conclusions.

    Claude.ai traffic showed a 64% engagement rate and an average session duration of over 3 minutes. The dominant landing page was an NYC Summer Internships guide, accounting for over 60% of all Claude sessions. Geographic concentration was academic: Ithaca (Cornell), State College (Penn State), Washington DC. The users arriving from Claude were reading to act — they needed specific information, they found it, they stayed.

    ChatGPT traffic showed a 21% engagement rate and an average session of 24 seconds. The top landing page was a cherry blossom guide. The users were fact-grabbing: they asked ChatGPT where to see cherry blossoms in New York, got a citation, clicked through, confirmed the location, and left. The content served its purpose in under half a minute.

    Copilot traffic was between the two: 46% engagement, roughly 2-minute sessions, desktop-heavy, concentrated in New York’s suburbs. The top pages were civic services — SNAP benefits, tenant rights, transit discounts. These users were in planning mode, researching before they decided or applied.

    The Finding That Reframes GEO

    The cross-AI page overlap query was the most important one in the entire four-session arc. I asked Analytics Advisor which pages appeared in the top landing pages for more than one AI source. Only one real content page appeared in all three: the cherry blossom guide.

    The obvious interpretation is that the cherry blossom guide was “AI-optimized.” The actual interpretation, once you look at the full traffic breakdown, is the opposite. Bing drove 59 sessions to that page. Yahoo drove 16 at 75% engagement and a 3-minute 46-second average session. DuckDuckGo drove 35. The combined AI traffic to that page was 32 sessions — 17% of total. The AI platforms were citing it because traditional search engines had already validated it as the highest-quality answer in the index.

    AI citations are downstream of search quality, not upstream. The path to getting cited by ChatGPT, Claude, and Copilot is not to optimize for AI retrieval patterns. It is to build pages that win on Bing and Yahoo with enough depth that AI models treat them as authoritative sources. The GEO play is a traditional SEO play with better content.

    The Content Strategy That Follows

    Once you have the per-AI behavioral profiles, you have a content variant framework. The same article can be written in three structural architectures, each tuned to how one AI model retrieves and presents information.

    The Claude variant is dense and process-oriented. Headers, eligibility criteria, numbered steps, official program names. Built for the student or researcher who arrived with a specific question and needs a complete answer they can act on.

    The ChatGPT variant is a scannable list. Named items, one specific detail per item, direct answer in the first two sentences. Built for the user who will spend 24 seconds on the page and needs the answer immediately or they’re gone.

    The Copilot variant is comparison and planning framing. What to know before you go, Option A versus Option B, cost context, logistics. Built for the desktop user doing research before they make a decision.

    The core article is the same. The architecture is different. The AI that cites you depends on which structure you used.

    The Methodology Is the Product

    The query sequence I developed across these four sessions is a repeatable extraction methodology. It works on any GA4 property with Analytics Advisor enabled. The intelligence it produces — per-AI audience profiles, geographic signals, velocity trends, cross-AI content overlap — is not available through DataForSEO, SpyFu, or GSC. It requires Gemini’s reasoning layer operating on top of your property data, orchestrated by a structured query architecture.

    I have packaged the complete methodology as a downloadable kit: the full query architecture across all four sessions, the capture protocol, the content variant framework, and the flags to escalate before your next content sprint. It is called Books for Bots: GA4 AI Referral Audit Kit.

    The free version covers Session 3 alone — the AI deep dive queries that surface your ChatGPT, Claude, and Copilot traffic split. That alone will show you something most site owners have never seen: which AI is sending them traffic, to which pages, and how engaged those users actually are.

    The full kit covers all four sessions and includes the content variant framework that translates the behavioral data into a writing system.

    Both are available at tygartmedia.com. What you do with the data after that is yours.

  • Custom Agents vs Basic Notion AI: When You Actually Need the Upgrade

    Custom Agents vs Basic Notion AI: When You Actually Need the Upgrade

    Anchor fact: Custom Agents are available on Business and Enterprise plans only. They run autonomously on triggers or schedules, can work for up to 20 minutes per task across hundreds of pages, and starting May 4, 2026, consume Notion Credits at $10 per 1,000.

    Do you need Notion Custom Agents or is basic Notion AI enough?

    Basic Notion AI handles inline drafting, summaries, and reactive prompts within a page. Custom Agents add proactive execution — running on schedules or triggers, working autonomously for up to 20 minutes, and using skills and Workers. Choose Custom Agents only if you have recurring autonomous workflows that justify Business-plan pricing and Notion Credit consumption.

    The 60-second version

    Most operators don’t need Custom Agents. They think they do because the marketing makes Custom Agents sound essential, but the honest answer is that basic Notion AI plus standard agent prompts cover most knowledge-work needs. Custom Agents earn their cost only when you have specific, repeating, autonomous work — things that run on a schedule or trigger without you starting them. If you don’t have that pattern in your workflow, you’re paying for capability you won’t use.

    The honest comparison

    Basic Notion AI (included on Plus, Business, Enterprise plans):

    • Inline writing assistance — draft, rewrite, summarize, translate
    • Q&A over your workspace content
    • Standard AI Autofill on databases
    • Meeting notes summarization
    • Reactive: you prompt, it responds

    Custom Agents (Business and Enterprise plans only):

    • Everything above, plus:
    • Runs on schedules or triggers without prompting
    • Can work autonomously for up to 20 minutes per task
    • Spans hundreds of pages in a single run
    • Skills can be attached for repeatable workflows
    • Workers integration (developer preview) for code execution
    • Can integrate with Calendar, Mail, Slack at agent level
    • After May 4, 2026: consumes Notion Credits at $10/1000

    When Custom Agents are worth it

    Five workflow patterns where Custom Agents pay off:

    1. Recurring deliverables. Weekly status reports, monthly board prep, daily standups. If you produce the same shape of document on a schedule, an agent that runs Friday at 4 PM and drops the draft in your inbox is worth real money in time saved.

    2. Continuous database enrichment. A CRM that needs new leads scored, categorized, and routed within minutes of arrival. A content database that needs incoming articles tagged and summarized. An ops database that needs items checked for SLA breaches.

    3. Cross-source synthesis on demand. “Pull everything from the last two weeks across Slack, Calendar, and our project pages and tell me what’s at risk.” This is a 20-minute autonomous task that would take a human two hours.

    4. Multi-step workflows with handoffs. Triage incoming → route to owner → draft response → flag exceptions. The chain is what makes it agent work, not assistant work.

    5. Off-hours and overnight work. If you’d benefit from work happening while you sleep, agents are the only Notion layer that can do it. Reactive AI sits idle until you arrive.

    When basic Notion AI is enough

    Most knowledge workers fit here:

    • Solo writers and researchers who need help drafting and summarizing
    • Teams of fewer than 10 where work is mostly real-time collaborative
    • Workflows where the AI is occasional, not scheduled
    • Anyone on Plus plan (Custom Agents aren’t available anyway)
    • Anyone whose AI usage is “I ask, it answers” — that’s reactive, not agentic

    If you’re in this group, upgrading to Business for Custom Agents is paying for capacity you won’t use. Stay with basic AI and revisit when the workflow pattern changes.

    The cost calculus after May 4

    Before May 4, 2026, Custom Agents are free to try on Business and Enterprise. After, every run consumes credits at $10 per 1,000. Real numbers:

    • A simple agent run (single-page summary): typically a handful of credits — pennies
    • A complex multi-step run (synthesis across many pages, multiple skills chained): can run into the dozens or hundreds of credits — measurable dollars
    • A daily scheduled agent that runs 30 days/month at moderate complexity: budget low tens of dollars per agent per month

    Math gets serious when you have many agents running daily. A workspace with 10 active Custom Agents can easily consume hundreds of dollars per month in credits on top of Business-plan seat fees. That’s the ROI conversation that turns “I’m experimenting with agents” into “I run a small fleet on a budget.”

    The decision framework

    Walk yourself through these four questions:

    1. Do you have recurring work on a schedule? No → basic AI is fine.
    2. Are you on Business or Enterprise? No → Custom Agents aren’t available. Upgrade or stay with basic.
    3. Does the time saved per agent run, multiplied by frequency, exceed the credit cost? No → basic AI plus manual prompts is cheaper.
    4. Are you willing to manage the credit pool monthly? No → don’t take on the operational overhead.

    If all four are yes, Custom Agents earn their place. If any is no, basic Notion AI is the right call.

    Reactive AI sits idle until you arrive.

    Sources

    • Notion 3.3 Custom Agents release notes (February 24, 2026)
    • Notion Help Center — Custom Agent pricing
    • Notion Pricing page (April 2026)

    Continue the journey

    This article is part of the May 3 Cliff Decision journey-pack on Tygart Media. Here’s where to go next: