Category: Tygart Media Editorial

Tygart Media’s core editorial publication — AI implementation, content strategy, SEO, agency operations, and case studies.

  • How to Wire Claude Into Your Notion Workspace (Without Giving It the Keys to Everything)

    How to Wire Claude Into Your Notion Workspace (Without Giving It the Keys to Everything)

    The step most tutorials skip is the one that actually matters.

    Every guide to connecting Claude to Notion walks you through the same mechanical sequence — OAuth flow, authentication, running claude mcp add, and done. It works. The connection lights up, Claude can read your pages, write to your databases, and suddenly your AI has the run of your workspace. The tutorials stop there and congratulate you.

    Here’s the part they don’t mention: according to Notion’s own documentation, MCP tools act with your full Notion permissions — they can access everything you can access. Not the pages you meant to share. Everything. Every client folder. Every private note. Every credential you ever pasted into a page. Every weird thing you wrote about a coworker in 2022 and forgot was there.

    In most setups the blast radius is enormous, the visibility is low, and the decision to lock it down happens after something goes wrong instead of before.

    This is the guide that takes the extra hour. Wiring Claude into your Notion workspace is straightforward. Wiring Claude into your Notion workspace without giving it the keys to everything takes a few additional decisions, a handful of specific configuration choices, and a mental model for what should and shouldn’t flow across the connection. That’s the hour worth spending.

    I run this setup across a real production workspace with dozens of active properties, real client work, and data I genuinely don’t want an AI to have unbounded access to. The pattern below is what works. It is also honest about what doesn’t.


    Why Notion + Claude is worth doing carefully

    Before the mechanics, it’s worth being clear about what you get when you wire this up correctly.

    Claude with access to Notion is not Claude with a better search function. It is a Claude that can read the state of your business — briefs, decisions, project status, open loops — and reason across them to help you run the operation. It can draft follow-ups to conversations it finds in your notes. It can pull together summaries across projects. It can take a decision you’re weighing, find every related piece of context in the workspace, and give you a grounded opinion instead of a generic one.

    That’s the version most operator-grade users want. And it’s only valuable if the trust boundary is drawn correctly. A Claude that has access to your relevant context is a superpower. A Claude that has access to everything you’ve ever written is a liability waiting to catch up with you.

    The whole article is about drawing that boundary on purpose.


    The two connection options (and which one you actually want)

    There are two ways to connect Claude to Notion in April 2026, and the right one depends on what you’re doing.

    Option 1: Remote MCP (Notion’s hosted server). You connect Claude — whether that’s Claude Desktop, Claude Code, or Claude.ai — to Notion’s hosted MCP endpoint at https://mcp.notion.com/mcp. You authenticate through OAuth, which opens a browser window, you approve the connection, and it’s live. Claude can now read from and write to your workspace based on your access and permissions.

    This is the officially-supported path. Notion’s own documentation explicitly calls remote MCP the preferred option, and the older open-source local server package is being deprecated in favor of it. For most operators, this is the right answer.

    Option 2: Local MCP (the legacy / open-source package). You install @notionhq/notion-mcp-server locally via npm, create an internal Notion integration to get an API token, and configure Claude to talk to the local server with your token. You then have to manually share each Notion page with the integration one by one — the integration only sees pages you explicitly grant access to.

    This path is more work and is being phased out. But there’s one genuine reason to still use it: the local path uses a token and the remote path uses OAuth, which means the local path works for headless automation where a human isn’t around to click OAuth buttons. Notion MCP requires user-based OAuth authentication and does not support bearer token authentication. This means a user must complete the OAuth flow to authorize access, which may not be suitable for fully automated workflows.

    For 95% of setups, remote MCP is the right answer. For the 5% running true headless agents, the local package is still the pragmatic choice even though it’s on its way out.

    The rest of this guide assumes remote MCP. I’ll flag the places the advice differs for local.


    The quiet part Notion tells you out loud

    Before we get to the setup, one more thing you need to internalize because it shapes every decision below.

    From Notion’s own help center: MCP tools act with your full Notion permissions — they can access everything you can access.

    Read that sentence twice.

    If you are a workspace member with access to 140 pages across 12 databases, your Claude connection can access 140 pages across 12 databases. Not the 15 you’re working on today. All of them. OAuth doesn’t scope you down to “this project.” It says yes or no to “can Claude see your workspace.”

    This is fine when your workspace is already organized the way you’d want an AI to see it. It is catastrophic when it isn’t, because most workspaces have accumulated years of drift, private notes, credential-adjacent content, sensitive client data, and old experiments that nobody bothered to clean up.

    So before you connect anything, you do the workspace audit. Not because Notion says so. Because your future self will thank you.


    The pre-connection audit (the step tutorials skip)

    Fifteen minutes with the workspace, before you click the OAuth button. Here’s the checklist I run through:

    Find anything that looks like a credential. Search your workspace for the words: password, API key, token, secret, bearer, private key, credentials. Read the results. Move anything sensitive to a credential manager (1Password, Bitwarden, a password-protected vault — not Notion). Delete the Notion copies.

    Find anything you wouldn’t want an AI to read. Search for: divorce, legal, lawsuit, personal, venting, complaint, therapist. Yes, really. People put things in Notion they’ve forgotten are in Notion. An AI that has access to everything you can access will find those things and occasionally surface them in responses. This is embarrassing at best and career-ending at worst.

    Look at your database of clients or contacts. Is there anything in there that shouldn’t travel through an AI provider’s servers? Notion processes MCP requests through Notion’s infrastructure, not yours. Sensitive legal matters, medical information, financial details about third parties — these may deserve a workspace or sub-page that stays outside of what Claude is allowed to see.

    Identify what Claude actually needs. Make a short list: your active projects, your working databases, your briefs page, your daily/weekly notes. This is what you actually want Claude to have context on. The rest is noise.

    Decide your posture. Two options here. You can run Claude against your main workspace and accept the blast radius, or you can create a separate workspace (or a teamspace) that contains only the pages and databases you want Claude to see, and connect Claude to that one. The second option is more work upfront. It is also the only version that actually draws the boundary.

    I run the second option. My Claude-facing workspace is genuinely a subset of what I work with, and the rest of my Notion is on a different membership. It took an hour to set up. It was worth it.


    Connecting remote MCP to Claude Desktop

    Now the mechanics. Starting with Claude Desktop because it’s the simplest.

    Claude Desktop gets Notion MCP through Settings → Connectors (not the older claude_desktop_config.json file, which is being phased out for remote MCP). This is available on Pro, Max, Team, and Enterprise plans.

    Open Claude Desktop. Settings → Connectors. Find Notion (or add a custom MCP server with the URL https://mcp.notion.com/mcp). Click Connect. A browser window opens, Notion asks you to authenticate, you approve. Done.

    The connection now lives in your Claude Desktop. You can start a new conversation and ask Claude to read a specific page, summarize a database, or draft something based on workspace content, and it will.

    One hygiene note: Claude Desktop connections are per-account. If you have multiple Claude accounts (say, a personal Pro and a work Max), each one needs its own connection to Notion. The good news is you can point each one at a different Notion workspace — personal Claude at personal Notion, work Claude at work Notion. This is the operator pattern I recommend for anyone running more than one business context through Claude.


    Connecting remote MCP to Claude Code

    Claude Code is the path most operators actually run at depth, because it’s the version of Claude that lives in your terminal and can compose MCP calls into real workflows.

    The command is one line:

    claude mcp add --transport http notion https://mcp.notion.com/mcp

    Need this set up for your team?

    I set up Claude integrations, GCP infrastructure, and AI workflows for businesses. If you’d rather ship than configure — will@tygartmedia.com

    Then authenticate by running /mcp inside Claude Code and following the OAuth flow. Browser opens, Notion asks you to authorize, you approve, and the connection is live.

    A few options worth knowing about at setup time:

    Scope. The --scope flag controls who gets access to the MCP server on your machine. Three options: local (default, just you in the current project), project (shared with your team via a .mcp.json file), and user (available to you across all projects). For Notion, user scope is usually right — you’ll want Claude to reach Notion from any project you’re working in, not just the current one.

    The richer integration. Notion also ships a plugin for Claude Code that bundles the MCP server along with pre-built Skills and slash commands for common Notion workflows. If you’re doing this seriously, install the plugin. It adds commands like generating briefs from templates and opening pages by name, and saves you from writing your own.

    Checking what’s connected. Inside Claude Code, /mcp lists every MCP server you’ve configured. /context tells you how many tokens each one is consuming in your current session. For Notion specifically, this is useful because MCP servers have non-zero context cost even when you’re not actively using them — every tool exposed by the server sits in Claude’s context, eating tokens. Running /context occasionally is how you notice when an MCP connection is heavier than you expected.


    The permissions pattern that actually protects you

    Now we’re past the mechanics and into the hygiene layer — the part that most guides don’t cover.

    Once Claude is connected to your Notion workspace, there are three specific configuration moves worth making. None of them are hard. All of them pay rent.

    1. Scope the workspace, don’t scope the connection

    The OAuth connection doesn’t let you say “Claude can see these pages but not those.” It lets you say “Claude can see this workspace.” So the place to draw the boundary is at the workspace level, not at the connection level.

    If you have sensitive content in your main workspace, move it. Create a separate workspace for Claude-facing content and keep the sensitive stuff out. Or use Notion’s teamspace feature (Business and Enterprise) to isolate access at the teamspace level.

    This feels like over-engineering until the first time Claude surfaces something in a response that you had forgotten was in your workspace. After that, it doesn’t feel like over-engineering.

    2. For Enterprise: turn on MCP Governance

    If you’re on the Enterprise plan, there’s an admin-level control worth enabling even if you trust your team. From Notion’s docs: with MCP Governance, Enterprise admins can approve specific AI tools and MCP clients that can connect to Notion MCP — for example Cursor, Claude, or ChatGPT. The approved-list pattern is opt-in: Settings → Connections → Permissions tab, set “Restrict AI tools members can connect to” to “Only from approved list.”

    Even if you only approve Claude today, the control gives you the ability to see every AI tool anyone on your team has connected, and to disconnect everything at once with the “Disconnect All Users” button if you ever need to. That’s the kind of control you want to have configured before you need it, not after.

    3. For local MCP: use a read-only integration token

    If you’re using the local path (the open-source @notionhq/notion-mcp-server), you have more granular control than the remote path gives you. Specifically: when you create the integration in Notion’s developer settings, you can set it to “Read content” only — no write access, no comment access, nothing but reads.

    A read-only integration is the right default for anything exploratory. If you want Claude to be able to write too, enable write access later when you’ve decided you trust the specific workflow. Don’t give write access by default just because the integration setup screen presents it as an option.

    This is the one place the local path is actually stronger than remote — you can shape the integration’s capabilities before you grant it access, and the integration only sees the specific pages you share with it. For high-sensitivity setups, this granularity is worth the tradeoff of running the legacy package.


    Prompt injection: the risk nobody wants to talk about

    One more thing before we leave the hygiene section. It’s the thing the industry is least comfortable being direct about.

    When Claude has access to your Notion workspace, Claude also reads whatever is in your Notion workspace. Including pages that came from outside. Including meeting notes that were imported from a transcript service. Including documents shared with you by clients. Including anything you pasted from the web.

    Every one of those is a potential vector for prompt injection — hidden instructions buried in content that, when Claude reads the content, hijack what Claude does next.

    This is not theoretical. Anthropic itself flags prompt injection risk in the MCP documentation: be especially careful when using MCP servers that could fetch untrusted content, as these can expose you to prompt injection risk. Notion has shipped detection for hidden instructions in uploaded files and flags suspicious links for user approval, but the attack surface is larger than any detection system can fully cover.

    The practical operator response is three-part:

    Don’t give Claude access to content you didn’t write, without reading it first. If a client sends you a document and you paste it into Notion and Claude has access to that database, you have effectively given Claude the ability to be instructed by your client’s document. This might be fine. It might be a problem. Read the document before it goes into a Claude-accessible location.

    Be suspicious of workflows that chain untrusted content into actions. A workflow where Claude reads a web-scraped summary and then uses that summary to decide which database row to update is a prompt injection target. If the scraped content can shape Claude’s action, the scraped content can be weaponized.

    Use write protections for anything consequential. Anything where the cost of Claude doing the wrong thing is real — sending an email, deleting a record, updating a client-facing page — belongs behind a human-approval gate. Claude Code supports “Always Ask” behavior per-tool; use it for writes.

    This sounds paranoid. It’s not paranoid. It’s the appropriate level of caution for a class of attack that is genuinely live and that the industry has not yet figured out how to fully defend against.


    What this actually enables (the payoff section)

    Once you’ve done the setup and the hygiene work, here’s what you now have.

    You can sit down at Claude and ask it questions that require real workspace context. What’s the status of the three projects I touched last week? Pull together everything we’ve decided about pricing across the client work this quarter. Draft a response to this incoming email using context from our ongoing conversation with this client. Claude reads the relevant pages, synthesizes across them, and responds with actual grounding — not a generic answer shaped by whatever prompt you happen to type.

    You can run Claude Code against your workspace for development-adjacent operations. Generate a technical spec from our product page notes. Create release notes from the changelog and feature pages. Find every page where we’ve documented this API endpoint and reconcile the inconsistencies.

    You can set up workflows that flow across tools. Claude reads from Notion, acts on another system via a different MCP server, writes results back to Notion. This is the agentic pattern the industry keeps talking about — and with the right permissions hygiene, it actually becomes usable instead of scary.

    None of this is theoretical. I use this pattern every working day. The value is real. The hygiene discipline is what keeps the value from turning into a liability.


    When this setup goes wrong (troubleshooting honestly)

    Five failure modes I’ve seen, in order of frequency.

    Claude doesn’t see the page you asked about. For remote MCP, this almost always means the page is in a workspace you’re not a member of, or in a teamspace you don’t have access to. For local MCP, it means the integration hasn’t been granted access to that specific page — you have to go to the page, click the three-dot menu, and add the integration manually.

    OAuth flow doesn’t complete. Usually a browser issue — popup blocker, wrong Notion account signed in, session expired. Clear auth, try again. If Claude Desktop, disconnect the connector entirely and re-add.

    The connection succeeds but Claude doesn’t seem to be using it. Run /mcp in Claude Code to verify the server is listed and connected. If it’s there and Claude still isn’t invoking it, the issue is usually in how you’re asking — Claude won’t reach for MCP tools just because they exist; you need to phrase the request in a way that makes it obvious the tool is relevant. Find the page about X in Notion works better than tell me about X.

    MCP server crashes or returns errors. For remote, this is rare and usually resolves itself — Notion’s hosted server has the standard cloud-reliability profile. For local, check your Node version (the server requires Node 18 or later), your config file syntax (JSON is unforgiving about trailing commas), and your token format.

    Context token budget goes through the roof. Every MCP server in your connected list contributes tools to Claude’s context on every request. If you have five MCP servers configured, that’s five sets of tool descriptions being loaded into every conversation. Run /context in Claude Code to see the cost. If it’s painful, disconnect the servers you’re not actively using.


    The mental model that keeps you sane

    Here’s the mental model I use for the whole setup. It’s short.

    Claude plus Notion is like giving a new, very capable employee access to your business. You wouldn’t hand a new hire every password, every file, every client record, every private note on day one. You’d give them access to the specific things they need to do the job, watch how they use that access, and expand trust over time based on track record.

    The MCP connection works exactly that way. You decide what Claude gets to see. You decide what Claude gets to write. You watch how it uses that access. You expand the boundary as trust earns itself.

    The operators who get hurt by this kind of setup are the ones who skip the first step and give Claude everything on day one. The operators who get the real value out of it are the ones who treat the connection the way they’d treat any other employee — with deliberate scope, real oversight, and the willingness to revoke access if something goes wrong.

    That’s the discipline. That’s the whole thing.


    FAQ

    Do I need to install anything to connect Claude to Notion? For remote MCP (the recommended path), no installation is required — you connect via OAuth through Claude Desktop’s Settings → Connectors or Claude Code’s claude mcp add command. For local MCP (legacy), you install @notionhq/notion-mcp-server via npm and create an internal Notion integration.

    What’s the URL for Notion’s remote MCP server? https://mcp.notion.com/mcp. Use HTTP transport (not the deprecated SSE transport).

    Can Claude see my entire Notion workspace by default? Yes. MCP tools act with your full Notion permissions — they can access everything you can access. The boundary is set by your workspace membership and teamspace access, not by the MCP connection itself. If you need finer-grained control, isolate Claude-facing content into a separate workspace or teamspace.

    Can I use Notion MCP with automated, headless agents? Remote Notion MCP requires OAuth authentication and doesn’t support bearer tokens, which makes it unsuitable for fully automated or headless workflows. For those cases, the legacy @notionhq/notion-mcp-server with an API token still works, but it’s being phased out.

    What plans support Notion MCP? Notion MCP works with all plans for connecting AI tools via MCP. Enterprise plans get admin-level MCP Governance controls (approved AI tool list, disconnect-all). Claude Desktop MCP connectors are available on Pro, Max, Team, and Enterprise plans.

    Can my company’s admins control which AI tools connect to our Notion workspace? Yes, on the Enterprise plan. Admins can restrict AI tool connections to an approved list through Settings → Connections → Permissions tab. Only admin-approved tools can connect.

    Is Notion MCP secure for confidential business data? The MCP protocol itself respects Notion’s permissions — it can’t bypass what you have access to. However, content flowing through MCP is processed by the AI tool you’ve connected (Claude, ChatGPT, etc.), which has its own data handling policies. For highly sensitive content, the right move is to isolate it in a workspace that Claude doesn’t have access to, rather than relying on the protocol alone to contain it.

    What about prompt injection attacks through Notion content? Real risk. Anthropic explicitly flags it in their MCP documentation. Notion has shipped detection for hidden instructions and flags suspicious links, but no detection system catches everything. The operator response: don’t give Claude access to content you didn’t write without reviewing it first, be suspicious of workflows where untrusted content shapes Claude’s actions, and put human-approval gates on anything consequential.

    What’s the difference between Notion’s built-in AI and connecting Claude via MCP? Notion’s built-in AI (Notion Agent and Custom Agents) runs inside Notion and uses Notion’s integration with frontier models. Connecting Claude via MCP brings Claude — your chosen model, in your chosen interface, with its full capability — to your workspace as an external client. The built-in option is simpler; the MCP option is more powerful and composable across other tools.


    Closing note

    Most tutorials treat the connection as the goal. The connection is the easy part. The hygiene is the part that matters.

    If you wire Claude into your Notion workspace thoughtlessly, you’ve given a capable AI access to every corner of your operational history, and you’ll be surprised how much of what’s in there you’d forgotten. If you wire it in deliberately — with a scoped workspace, with the permissions you’ve thought about, with the posture of giving a new employee measured access — you’ve built something that pays rent every day without ever becoming the liability it could have been.

    One hour of setup. One hour of cleanup. And then one of the most useful AI configurations currently possible in April 2026.

    The intersection of Notion and Claude is where the operator work actually happens now. Worth setting up right.


    Sources and further reading

  • The Soda Machine Thesis: A Mental Model for Running an AI-Native Business on Notion

    The Soda Machine Thesis: A Mental Model for Running an AI-Native Business on Notion

    The hardest part of running an AI-native business on Notion in 2026 isn’t the tools. The tools are fine. The tools ship regularly and they work. The hard part is that the vocabulary hasn’t caught up with the reality, and when the vocabulary is wrong, your design choices get wrong too.

    Here’s what I mean. When I started seriously composing Workers, Agents, and Triggers in Notion, I found I was making the same kinds of mistakes over and over. Building a worker for something an agent could have handled with good instructions. Attaching five tools to an agent that only needed two. Setting up a scheduled trigger for something that should have fired on an event. After the third or fourth time, I realized the mistakes had a common source: I didn’t have a mental model for when to reach for which piece.

    Notion doesn’t give you one. The documentation is accurate but it’s a list of capabilities. Vendor-shaped — here is what Custom Agents can do, here is what Workers do, here are your trigger types. All true. All useless for the question I actually had, which was given a job I want done, which piece do I build?

    So I made a mental model. It’s imperfect and it’s mine, but it has survived a few months of real use and it has saved me from a dozen architecture mistakes I would have otherwise made. This article is the model.

    I call it the Soda Machine Thesis. It might sound silly. It works.


    The core analogy

    Workers are syrups. Agents are soda fountain machines. Triggers are how the machine dispenses.

    When someone asks for a custom soda fountain — a Custom Agent — three decisions get made, in order:

    1. Which syrups (workers and tools) load into this machine? What capabilities does it need access to? What external services does it need to reach? What deterministic operations does it need to perform?
    1. How is the machine programmed? What are its instructions? What’s its job description? How does it think about what it’s doing? (This is the part where agents diverge most — two machines with identical syrups behave completely differently based on instructions.)
    1. How does it dispense? Does it pour when someone presses a button (manual trigger)? Does it pour on a schedule (timer)? Does it pour when the environment changes — a page gets created, a status flips, a comment gets added (event sensor)?

    That’s the whole model. Three questions, in that order. If you can answer all three cleanly, you have a working agent. If you can’t answer one of them, you have an agent that is going to produce noise and frustrate you.

    I have watched this analogy clarify a dozen conversations that were going nowhere. “I want an agent that…” — and then I ask the three questions, and halfway through the answers it becomes obvious what the person actually wants is a simpler thing. Sometimes they don’t need an agent at all, they need a template with a database automation. Sometimes they need a worker, not an agent. Sometimes they need an agent with zero workers and better instructions.

    The analogy does real work. That’s the whole point of a mental model.


    Where the analogy holds

    The map is cleaner than you’d expect.

    Workers are syrups. Stateless, parameterized, reusable. The same worker — fetch-url, summarize, post-to-channel, whatever — can power a dozen agents. You build it once, you use it everywhere. A worker that sends an email works the same way whether it’s being called by a triage agent, a brief-writer, or a customer-response agent. That’s what syrup means: the ingredient doesn’t care which drink it’s going into.

    Agents are machines. They select, sequence, and orchestrate. An agent knows when to reach for which worker. An agent knows what the job is and reasons about how to do it. An agent can read a database, synthesize what it finds, reach for a tool to do a specific deterministic step, synthesize again, and return a result. An agent is a little piece of judgment on top of a set of capabilities.

    Triggers are how the machine dispenses. This is the cleanest part of the map because Notion’s own trigger types map almost 1:1 onto the analogy:

    • Button press or @mention → manual dispatch (“I’m pressing the button for a Coke”)
    • Schedule → timer (“pour me a drink at 7am every day”)
    • Database event → sensor (“someone just put a cup under the dispenser; fill it”)

    You don’t need to memorize trigger type names. You need to ask “how should this machine know it’s time to pour?” Once you know the answer, the trigger type follows.


    Where the analogy leaks (and what to do about it)

    No analogy is perfect. This one has four honest leaks that are worth knowing before you rely on the model.

    1. Agents have native hands, not just syrups

    A Custom Agent can read pages, search the workspace, write to databases, and send notifications without a single worker attached. Workers are specialty syrups for the things the base machine can’t do natively — external APIs, deterministic writes to strict database schemas, code execution, anything requiring exact outputs every time.

    This means not every agent needs workers. In fact, my highest-leverage agents often have zero workers. They use the base machine’s native capabilities, combined with strong instructions, to do the job.

    The practical consequence: don’t reach for a worker reflexively. Start by asking what the agent can do with just its native hands and good instructions. Only add workers when the agent genuinely needs capability it doesn’t have.

    2. Machine programming matters as much as syrup selection

    The instructions you give an agent — its system prompt, its job description, its operating rules — are doing as much work as the workers you attach. Two agents with identical workers will behave completely differently based on how they’re instructed.

    People tend to under-invest here. They attach five workers, write three sentences of instruction, and wonder why the agent is flaky. The fix is not more workers. The fix is writing instructions the way you’d write onboarding docs for a new employee — specific, scoped, honest about edge cases, clear about what the agent should do when it’s uncertain.

    My rule: if I’m about to attach a worker because the agent “keeps getting it wrong,” I first check whether better instructions would fix the problem. Nine times out of ten they would.

    3. Workers aren’t a single thing

    This is the leak that surprised me when I learned it. There are actually three kinds of worker, and they behave differently:

    • Tools — on-demand capabilities. The classic syrup. An agent calls them when it needs them. Example: a worker that fetches a URL and returns the text.
    • Syncs — background data pipelines that run on a schedule and write to a database. Not dispensed by an agent. These are more like an ice maker — they run on the infrastructure, filling the building up, and the machines use what the ice maker produces.
    • Automations — event handlers that fire when something happens in the workspace. Like a building’s fire suppression — nobody’s pressing a button; the environment triggers it.

    This matters because syncs and automations don’t need an agent to dispatch them. They run autonomously. If you’re building something that feeds a database on a schedule, that’s a sync, not a tool, and it doesn’t need an agent. If you’re building something that reacts to a page being updated, that’s an automation, not a tool.

    Getting this wrong is one of the most common architecture mistakes. People build an agent to dispatch a sync because they think everything has to flow through an agent. It doesn’t. Let the infrastructure do the infrastructure’s job.

    4. Determinism vs. judgment is the design axis

    The thing the soda analogy doesn’t capture well is that workers and agents are not just interchangeable building blocks. They serve fundamentally different purposes:

    • Workers shine when you want deterministic behavior. Same input, same output, every time. Schema-strict writes. External API calls where the shape of the request and response are fixed.
    • Agents shine when you want judgment, composition, and natural-language reasoning. Variable inputs. Fuzzy requirements. Synthesis across multiple sources.

    The red flag: building a worker for something an agent could do reliably with good instructions. You’re over-engineering.

    The green flag: an agent keeps being flaky at a specific operation. Harden that operation into a worker. Now the agent handles the judgment part, and the worker handles the reliable part.


    The “should this be a worker?” test

    When I’m trying to decide whether to build a worker or let an agent handle something, I run a five-point checklist. If two or more are true, build a worker. If fewer than two are true, stay manual or solve it with agent instructions.

    1. You’ve done the manual thing three or more times. The third time is the signal. First time is discovery, second time is coincidence, third time is a pattern worth capturing.
    2. The steps are stable. If you’re still figuring out how to do the thing, don’t codify it yet. You’ll codify the wrong version and have to rewrite.
    3. You need deterministic schema compliance. Writes that must fit a database schema exactly are worker territory. Agents can write to databases, but if the schema has strict requirements, a worker is more reliable.
    4. You’re calling an external service Notion can’t reach natively. This is often the clearest signal. If it’s outside Notion and needs to be reached programmatically, it’s a worker.
    5. The judgment required is minimal or already encoded in rules. If the decisions are simple enough to express as code, a worker is fine. If the decisions need real reasoning, it’s agent territory.

    This test is not a strict algorithm. It’s a gut-check that catches the most common over-engineering mistakes before they happen.


    The roles matter more than the technology

    Here’s the extension of the analogy that actually made the whole thing click for me.

    Every construction project has four roles. The Soda Machine Thesis as I originally described it has three of them. The one I hadn’t named — and the one you’re probably missing in your own workspace — is the Architect.

    Construction roleYour system
    Owner / DeveloperThe human in the chair. Commissions work, approves output, holds the keys.
    ArchitectThe AI-in-conversation. Claude, Notion Agent in chat, whatever model you’re actively designing with.
    General ContractorA Custom Agent running in production.
    SubcontractorA Worker. Called in for specialty work.

    The distinction that matters: the Architect and the General Contractor are the same technology, playing different roles. When you’re chatting with a model about how to design a system, that model is acting as Architect — designing the thing before it gets built. When a Custom Agent runs autonomously against your databases overnight, it’s acting as General Contractor — executing the design.

    Same underlying AI. Completely different role.

    Getting this distinction wrong is how operators end up either (a) over-trusting autonomous agents with design decisions they shouldn’t be making, or (b) under-using conversational AI for the system-design work it’s actually best at. Chat with the Architect. Deploy the GC. Don’t confuse them.


    Levels of automation (what you’re actually doing at each stage)

    Most operators cycle through these levels as they get deeper into the pattern. Knowing which level you’re currently at — and which level a specific problem actually needs — prevents a lot of wasted effort.

    Level 0: The Owner does it. You manually do the thing. This is fine. Everything starts here. Some things should stay here.

    Level 1: Handyman. You’ve built a template, a button, a saved view. No AI involvement. Native Notion helps you do it faster. Still you doing the work.

    Level 2: Standard Build. Notion’s native automations handle it. Database triggers fire on status changes. Templates get applied automatically. Still deterministic, still no AI.

    Level 3: Self-Performing GC. A Custom Agent does the work natively — reading and writing inside Notion, reasoning about context, no workers attached. This is where agents earn their keep for the first time.

    Level 4: GC + One Trade. An agent with one specialized worker. The agent handles judgment; the worker handles a single deterministic step. This is the most common production pattern.

    Level 5: Full Project Team. An agent orchestrating multiple workers in sequence. Real project coordination. A brief-writer agent that calls a URL-capture worker, then a summarization worker, then a publishing worker, all in order.

    Level 6: Program Management. Multiple agents coordinated by an overarching structure. One agent that dispatches to specialist agents. Portfolio-level orchestration. This is where it gets complicated and where most operators don’t need to go.

    The mistake I made early on, and watch other operators make, is jumping to Level 5 when Level 3 would have worked. More pieces means more failure points. Solve it at the lowest level that works.


    Governance: permits, inspections, and change orders

    The analogy extends further than I expected into governance — which is the unsexy part of running real agents in production, but it’s the part that separates operators who keep their agents working from operators whose agents quietly stop working without them noticing.

    • Pulling a permit = Attaching a worker to an agent. You’re granting that specialty trade permission to work on your job. This is not a nothing decision. Be deliberate.
    • Building inspection = Setting a worker tool to “Always Ask” mode. Before the work ships, the human reviews it. For any worker that does something consequential, this is the default.
    • Certificate of Occupancy = The moment a capability graduates from Building to Active status in your catalog. Before that moment, treat it as construction. After, treat it as load-bearing.
    • Change Order = Editing an agent’s instructions mid-project. The scope changed. Document it.
    • Punch List = The run report every worker should write on every execution — success and failure. No silent runs. If you can’t see what your agent did, you don’t know what it did.
    • Warranty work = Iterative fixes after a worker is deployed. v0.1 to v0.2 to v0.3. This never stops.

    The governance layer sounds boring but it’s what makes agents run for months instead of days. An agent without run reports eventually drifts, fails silently, and leaves you discovering the failure weeks later when the downstream thing it was supposed to do quietly stopped happening. The governance rituals — inspections, change orders, punch lists — are not overhead. They’re what makes the system durable.


    The revised one-sentence summary

    Putting it all together, here is the whole thesis in one sentence:

    Notion is the building. Databases are the floors. The Owner runs the project. Architects design in conversation. General Contractors (agents) execute on-site. Subcontractors (workers) run specialty trades. Syncs are maintenance contracts. Triggers are permits, sensors, and dispatch radios.

    If you can hold that sentence in your head, you can design automation in Notion without getting lost in the vocabulary. When you’re about to build something, ask: which role am I playing right now? Which role does this piece need to play? Who’s the Owner, who’s the Architect, who’s the GC, who’s the sub? If you can answer, the architecture writes itself.


    Practical takeaways

    If you made it this far, here are the five things I’d want you to walk away with:

    1. Not every agent needs workers. Start with native capabilities and strong instructions. Add workers only when the agent can’t do the thing otherwise.
    1. The third time is the signal. Don’t build infrastructure for something you’ve only done twice. You’ll build the wrong version. The third time is when the pattern has stabilized enough to capture.
    1. Syncs and automations don’t need an agent. If you’re feeding a database on a schedule, or reacting to a workspace event, let the infrastructure do it. Don’t wrap it in an agent for no reason.
    1. Separate the Architect from the GC. Use conversational AI to design the system. Use Custom Agents to run the system. Don’t let an autonomous agent make design decisions that should be made in conversation.
    1. Write run reports for everything. Silent success is worse than loud failure, because silent success is indistinguishable from silent failure until weeks later. Every agent, every worker, every run — writes a report somewhere readable.

    That’s the model. It is imperfect and it is mine. If you adopt it, make it your own. If you have a better one, I’d honestly like to hear about it.


    FAQ

    What’s the difference between a Notion Worker and a Custom Agent? A Worker is a coded capability — deterministic, reusable, typically written in TypeScript — that a Custom Agent can call. A Custom Agent is an autonomous AI teammate that lives in your workspace, has instructions, runs on triggers, and can optionally use Workers to do specialized tasks. Workers are capabilities. Agents are operators that can use those capabilities.

    Do I need Workers to use Custom Agents? No. Many Custom Agents run perfectly well with zero Workers attached, using only Notion’s native capabilities (reading pages, writing to databases, searching, sending notifications) plus well-written instructions. Workers become necessary when you need to reach external services or enforce strict deterministic behavior.

    What are the three trigger types for Custom Agents? Manual (button press, @mention, or direct invocation), scheduled (recurring on a timer), and event-based (a database page is created, updated, deleted, or commented on). Pick the one that matches how the agent should know it’s time to act.

    When should I build a Worker versus letting an Agent handle something? Build a Worker when at least two of these are true: you’ve done the manual thing three or more times, the steps are stable, you need deterministic schema compliance, you’re calling an external service Notion can’t reach, or the judgment required is minimal. If fewer than two are true, stay manual or solve it with agent instructions.

    What’s the difference between a Tool, a Sync, and an Automation? A Tool is an on-demand capability that an agent calls when needed. A Sync is a background pipeline that runs on a schedule and writes to a database — no agent required. An Automation is an event handler that fires when something changes in the workspace — also no agent required. Tools are dispatched by agents; syncs and automations run on the infrastructure.

    What’s the Architect/GC distinction? When you chat with AI to design a system, the AI is playing Architect — thinking about what should be built. When a Custom Agent runs autonomously in your workspace, it’s playing General Contractor — executing the design. Same technology, different role. Don’t confuse them: let Architects design, let GCs execute.

    Does this apply outside of Notion? The Soda Machine Thesis is written around Notion’s specific implementation of Workers, Agents, and Triggers, but the underlying pattern (deterministic capabilities + judgment layer + trigger mechanism) applies to most modern agent frameworks. The vocabulary may differ. The architecture is the same.


    Closing note

    Mental models earn their place by changing the decisions you make. If the Soda Machine Thesis changes how you decide what to build next in your Notion workspace, it has done its job. If it doesn’t, discard it and find one that does.

    The reason I wrote it down is that the vocabulary available for thinking about AI-native workspaces in 2026 is still mostly vendor vocabulary, and vendor vocabulary optimizes for describing what a product can do rather than helping operators make good choices. The operator vocabulary has to come from operators. This is mine, offered in that spirit.

    If you’re running this pattern and have refinements, they’re welcome. The thesis is a living document in my own workspace. It gets smarter every time someone pushes back.


    Sources and further reading

    This mental model builds on earlier conceptual work across multiple AI tools (Notion Agent, Claude, GPT) contributing to the same thesis over a series of architecture conversations. The framing evolved through disagreement more than consensus, which is how mental models usually get better.

  • The Notion Operating Company: How to Actually Run a Business on a Workspace in 2026

    The Notion Operating Company: How to Actually Run a Business on a Workspace in 2026

    There is a version of Notion most people use, and there is the version a small number of operators have quietly built — and in April 2026 those two versions are now so far apart that they’re barely the same product.

    The version most people use is a wiki. It is a place you put information you intend to come back to, and most of the time you don’t. Pages go stale. Databases grow faster than they get organized. The search gets worse as the content gets larger. You know this because you have seen your own Notion and felt the tug of guilt when you open it, the small calculation of whether it is worth the effort to fix any of this versus just writing the thing you need to write in a fresh page and adding it to the pile.

    The version a smaller number of people have built is an operating company. It runs on Notion. The human in the chair reads briefs written by AI, approves work, watches reports come back, adjusts priorities, and hands the next job out — and the human never leaves Notion. Everything that is expensive to move between tools does not move. The work comes to them.

    Those aren’t the same product anymore. They used to be. Notion was, for years, fundamentally a block editor with databases bolted on. What changed — what actually changed, not what the vendor said changed — is that over the last six months Notion stopped being a place you put things and started being a place you run things. Custom Agents shipped in late February. The Workers framework followed. MCP support matured. The Skills layer made repeatable workflows into commandable capabilities. What used to be a workspace is now closer to an operating system for a small business.

    Most coverage of this shift is either vendor-positive cheerleading or a product tour disguised as a guide. This is neither. This is how an actual operator runs a real, unglamorous business — dozens of properties, content production cycles, client work, all of it — out of Notion in 2026. The shape, the databases, the ritual, what goes inside the workspace and what stays outside, and where it still breaks.

    If you want a product tour you can find one on Notion’s own blog. If you want the honest operator version, keep reading.


    What “operating company” actually means

    The frame matters, so let’s be concrete about what it is.

    An operating company, in the sense I mean it, is the set of decisions, assets, people, and ongoing commitments that make a business actually go. Not the legal entity. The operating layer. In a traditional small business, that operating company lives in someone’s head, a few spreadsheets, a calendar, a CRM, an email inbox, a project tool, a file drive, a slack, a billing system, and the recurring pain of trying to hold all of it in mind at once.

    Running a business on Notion in 2026 means collapsing as much of that operating layer as possible into a single workspace that knows what it is. Not a place where you write things down. A place where the work is actually happening, where the state of the business is legible at a glance, where a decision made on Monday shows up in Thursday’s automatically-generated brief without anyone having to remember to copy it forward.

    The term I have started using is the Notion Operating Company. It captures the thing correctly: Notion is not the tool you use to run the company, it is the operating layer of the company. The humans make the calls, set the priorities, and absorb the parts that cannot be delegated. Everything else lives in the workspace and operates against the workspace.

    If that sounds like a personal productivity system scaled up, it is not. Personal productivity systems are closed loops. The Notion Operating Company is an open system that other humans, AI teammates, and external services read from and write to. The difference is legibility and composability, and in 2026 those are the qualities that separate a workspace that earns its place from a workspace that is a second pile.


    Why this suddenly works in 2026 (and didn’t in 2024)

    A few things had to be true at the same time for this pattern to become reliably available to small teams and solo operators. None of them were true two years ago.

    Custom Agents shipped. On February 24, 2026, Notion released Custom Agents as part of Notion 3.3. These are autonomous AI teammates that live inside your Notion workspace and handle recurring workflows on your behalf, 24 hours a day, 7 days a week. They do not wait for you to prompt them. You give them a job description, a trigger or schedule, and the data they need, and they run. That one change is the hinge the whole operating-company pattern swings on. Before Custom Agents, automation inside Notion was cosmetic — property updates, templated pages, simple reminders. After Custom Agents, a workspace can actually operate itself between human check-ins.

    The pricing makes it viable. Custom Agents are free to try through May 3, 2026, so teams have time to explore and see what works. Starting May 4, 2026, they use Notion Credits, available as an add-on for Business and Enterprise plans. The pricing matters because it turns out many workflows are cheap enough to run continuously, and the ones that aren’t are easy to audit once the dashboards shipped. Custom Agents are now 35–50% cheaper to run across the board, especially ones with repetitive tasks like email triage. They’re even more cost efficient when you pick new models like GPT-5.4 Mini & Nano, Haiku 4.5, and MiniMax M2.5 that use up to 10× fewer credits. The 10× model-routing move means a well-designed agent for an operator’s workspace costs real-world pennies to run daily.

    MCP connects the workspace to everything else. The Model Context Protocol, opened by Anthropic, gives the workspace a standardized way to reach external tools and services. Notion ships MCP support; most serious AI tools do. The practical consequence: a Custom Agent inside Notion can reach into a source-control system, post to a messaging tool, query a database, or trigger an external worker, without anyone writing glue code. Not every integration is seamless, but the floor has lifted.

    Skills turned workflows into commandable capabilities. Skills turn “that thing you always ask Notion Agent to do” into something it can do on command. Save your best workflows as skills like drafting weekly updates, reshaping a doc in your team’s format, or prepping briefs before a meeting. That matters because the skills layer is where institutional pattern-capture lives. The first time you solve a problem in your workspace, you solve it. The second time, you turn it into a skill. The third time, you invoke it by name. A workspace that accumulates skills gets faster over time instead of slower.

    Autofill became real. Use Autofill to keep your data fresh and up to date, now with all the power and intelligence of Custom Agents. Continuously enrich, extract, and categorize information across every row, so your database stays trustworthy without manual review. That changes what a Notion database is. Databases used to rot without manual maintenance. A self-maintaining database is a different kind of object.

    None of these individually would have tipped Notion from workspace to operating system. All of them together, shipped inside a twelve-month window, did.


    The shape of an operating company in Notion

    Let me describe the actual shape. This is not theoretical. This is the operational pattern that works, stripped of the specifics that would identify any one business.

    The Control Center

    At the root of the workspace is a single page called the Control Center. It is the first page you see when you open Notion. It is the page an AI teammate is told to read first when it is helping you with anything. It is the page a new human teammate reads on day one before they read anything else.

    The Control Center does not contain content. It contains pointers. Specifically:

    • Today — a surfaced view of whatever is actively happening today, pulled from the Tasks database, filtered to today or overdue
    • The live business state — three to five sentences updated continuously (by a Custom Agent, actually) describing where the business is, what is being worked on, what is on fire
    • The database index — a linked block for each operational database, in order of how often you touch them
    • The active projects list — rolled up from the Projects database, filtered to in-flight
    • The week — the current week’s focus, the working theme, what “winning the week” looks like
    • Open loops — the short list of unresolved decisions currently parked waiting for input

    The Control Center is roughly two screens long. It tells you what is happening and gives you the jumping-off points to go deeper. Anything that belongs on the Control Center is either updated automatically or so critical that manual maintenance is worth it.

    The database spine

    Under the Control Center live the operational databases. In a functioning operating company, these map directly to the actual entities the business deals with, not to organizational categories.

    For a service business, the spine typically includes: Clients, Projects, Tasks, Leads, Decisions, People (the humans you interact with externally), Assets, and a catch-all Inbox.

    For a content business, the spine typically includes: Properties (the things you publish on), Briefs, Drafts, Published, Distribution, Ideas, and Performance.

    For a product business, the spine looks different again: Features, Customers, Feedback, Roadmap, Releases, Incidents.

    The exact databases depend on the business. The pattern does not. Each database represents a real operational object. Each relation represents a real dependency. Each view answers a question someone actually asks regularly.

    The test for whether a database belongs on the spine is simple: can you describe, in one sentence, what decision this database helps someone make? If the answer is yes, it belongs. If the answer is “it’s where I put stuff about X,” it doesn’t.

    The agents layer

    Running on top of the database spine is the agents layer. This is the part that would not have existed in 2024.

    The operational pattern, in the workspace I actually run, has a handful of agents that each do one job and do it well.

    • The Triage Agent watches the Inbox database. Anything that lands there gets a priority, a category, and a pointer to the database it actually belongs in. It does not make big decisions. It takes the pile and turns it into a sorted pile.
    • The Morning Brief Agent runs once a day. It reads the Control Center state, the active projects, the top of the Tasks database, the calendar, and the unresolved Decisions, and writes a three-paragraph brief at the top of today’s Daily page. You wake up and the state of the business is already synthesized.
    • The Review Agent runs weekly on Fridays. It pulls what was completed, what stalled, and what slipped, and writes the weekly retro. It is not asking you to fill in a form. It is writing the retro and handing it to you to review.
    • The Enrichment Agent runs on database writes. When something new lands in a key database — a lead, a project, a decision — the agent fills in the fields that would otherwise require manual data entry. Research, links, categorization.
    • The Escalation Agent watches for states that require human attention. A project stalled for too long, a task with no owner, a decision parked past its decide-by date. It surfaces them on the Control Center.

    That’s five agents. Some workspaces I’ve seen run more. Most run fewer. The number is not the point; the pattern is: each agent has one job, one data source, one output surface, and a clear signal for when it should run.

    The constraint that keeps this from sprawling into chaos is a rule I’ve internalized: one agent, one job. The moment an agent tries to do three things, it does none of them well.

    The skills layer

    Beneath the agents, you accumulate skills over time. These are not agents; they’re invoked capabilities. “Generate a weekly client report in this format.” “Convert this meeting transcript into tasks.” “Draft a response to this inbound email in my voice.” Skills are the pattern-capture layer — the place where solved-problems become invocable capabilities.

    The skills layer grows by a specific rule: the third time you notice yourself doing the same thing manually, you turn it into a skill. Not the first time, not the second. The third time is the signal that it’s going to happen again, and the cost of capturing it is less than the cost of doing it manually from here forward.

    The source-of-truth boundary

    Here is where most Notion-as-OS writeups go silent, and it’s actually the most important thing in the whole pattern.

    Notion is not the source of truth for everything. It is the source of truth for the operational state of the business — what’s happening, what’s decided, what’s being worked on, what’s next. It is not the source of truth for code, for financial transactions, for legal documents, for anything that needs to survive an outage of Notion itself.

    Code lives in a source-control system. Money data lives in whatever financial system the business uses. Legal artifacts live in signed-document storage. Heavy compute runs outside Notion and reports back. The operating company is inside Notion; the substrate is not.

    The mental model I use: Notion is the bridge of the ship. The bridge runs the ship. The ship is not inside the bridge.

    This distinction is what prevents the whole pattern from collapsing. A workspace that tries to be the whole business eventually becomes unusable because it is bloated with content that doesn’t belong in a control plane. A workspace that is a control plane stays light, stays fast, and stays legible.


    The daily ritual (what it actually looks like)

    The pattern lives or dies in daily use. Let me describe what a normal working day looks like for an operator running on this pattern — the actual sequence, not the aspirational version.

    Open Notion. The Control Center loads. The Morning Brief Agent has already run; the top of today’s Daily page has a three-paragraph synthesis of the state of the business: what’s on fire, what’s progressing, what requires a decision today. Reading that takes ninety seconds.

    Scan the Inbox. The Triage Agent has already sorted whatever landed overnight. Each item has a category, a priority, and a pointer. You’re not doing the sort. You’re spot-checking the sort — agreeing, disagreeing, occasionally fixing, and dispatching the important items into their real databases.

    Check Escalations. The Escalation Agent has flagged the three things that need attention. You make the decisions. This is the part where being a human matters.

    Open today’s active project. Whatever you are actually working on is linked from the Control Center. You go there and do the work. Sometimes the work is writing in Notion. Sometimes the work is in an IDE, a chat window, a document, a call — Notion is where you come back to log what happened and what comes next.

    At a natural stopping point, log. The log is short. Two sentences on what just got done. Notion captures the timestamp. Over time the log becomes the actual record of how the business moves.

    Evening wrap. Five minutes. The day’s work closes out. Anything that didn’t get done gets re-dated. Tomorrow’s active page pre-stages.

    That’s the ritual. It takes under twenty minutes of overhead per day and gives you a fully legible operating record. The agents do the work that would otherwise be overhead. The human does the work that requires a human.

    The difference between an operator running this pattern and an operator running without it is not productivity on any individual task. It is the absence of the context-loss tax — the tax you pay every time you sit down and have to remember where you left off, what’s happening, what’s next. Pay that tax once a day at the beginning of the brief, and the rest of the day runs on continuous context.


    Where it still breaks (the honest part)

    This pattern is not finished. There are specific places where running a real operating company on Notion still hits walls, and pretending otherwise is the kind of dishonesty that catches up to you when the tool fails you at a bad moment.

    Heavy write workloads. Notion is not a database in the performance sense. If you are trying to push hundreds of updates per minute through the API, you are going to hit rate limits and you are going to have a bad time. The operational pattern is aware of this: heavy writes go to a real database first and are reflected into Notion in summary form.

    Reliable external integration. Custom Agents’ ability to reach external systems via MCP has improved a lot in 2026, but it is not ironclad. Agents that must succeed — send this email, charge this card, update this record — still belong in a purpose-built service, not in a Custom Agent. The rule I use: if the cost of the agent silently failing is real money or real trust, it doesn’t belong in Notion.

    Mobile agent management. Building, editing, and configuring Custom Agents requires the Notion desktop or web app. Mobile access for viewing and interacting with existing agents is supported, but agent creation and configuration is desktop/web only. This is fine but worth knowing. Operators who work primarily from phone can interact with agents but cannot build them on the go.

    Prompt injection. Custom Agents can encounter “prompt injection” attempts — when someone tries to manipulate an agent through hidden instructions in content it reads. This risk exists across connected tools, uploaded documents, and even internal communications. Notion has shipped detection, but the attack surface is real and growing. The practical operator response: don’t give agents access to anything they don’t strictly need, and review any external content an agent will read before granting access.

    The shape of the workspace matters more than it used to. A messy Notion workspace was merely annoying in 2024. A messy Notion workspace in 2026 makes your agents worse, because the agents are navigating the same structure you are. Disorganized databases produce disorganized agent outputs. The cost of workspace hygiene used to be cosmetic. It’s now functional.

    Credit economics at scale. Starting May 4, 2026, Custom Agents run on Notion Credits, a usage-based add-on available for Business and Enterprise plans. The pricing is $10 per 1,000 credits. Credits are shared across the workspace and reset monthly. Unused credits do not roll over to the following month. For a small operator, this is fine. Most workflows are cheap. For larger teams running many agents, credit consumption becomes a line item worth watching. Notion has shipped a credits dashboard to help, but budget discipline is a new muscle for Notion-native teams.

    None of these are dealbreakers. All of them are things the pattern has to work around. The honest version of this article tells you that up front.


    Notion Agent vs Custom Agents (the distinction that matters)

    One clarification because the terminology can confuse newcomers to the pattern.

    Custom Agents are team-wide AI teammates that run automatically on schedules or triggers. Notion Agent is a personal AI assistant that works on-demand when you ask. All Notion users get Notion Agent. Business and Enterprise customers get Custom Agents, priced under the Notion credit system.

    The operating-company pattern uses both. Notion Agent is the on-demand assistant — the one you invoke for “rewrite this paragraph” or “summarize this doc” or “find me every page that mentions X.” Custom Agents are the autonomous teammates that run the background rhythms.

    The mistake to avoid: trying to use Notion Agent for the background rhythms. It is not built for that. It runs when you ask. Custom Agents run when the world changes or when a schedule says so. Those are different tools for different jobs.


    Who this pattern is for

    To be clear about who gets the most out of the Notion Operating Company pattern:

    • Solo operators running real businesses. The leverage is highest here because there is no team to argue with about conventions. You decide the shape, you live in it.
    • Small teams (3–15 people) with a strong operational function. The pattern works if one person owns workspace architecture. It breaks if everyone is allowed to add databases and pages ad-hoc without a maintaining hand.
    • Agencies and consultancies running multi-property operations. Anywhere you need to coordinate lots of parallel work and keep the whole portfolio legible to one or two humans.
    • Knowledge-heavy businesses. Law firms, research shops, content operations, advisory services. The operating company pattern rewards businesses where the value is produced by synthesis across prior work.

    Where the pattern fits less well: businesses where most of the work happens outside any tool (field services, physical retail, manufacturing floors). Notion can still run the management layer, but most of the actual operational data lives elsewhere.


    How to start without building a cathedral

    The pattern I’ve described can sound like a project. It isn’t. Or rather, it can be — people build beautiful elaborate versions for a year and never actually use them. The better path is embarrassingly small steps.

    Week one: build the Control Center. Just that page. Two screens long. Link to the databases you already have, even if they’re messy. The Control Center is the anchor; everything else will build against it.

    Week two: add one Custom Agent. Pick the simplest high-frequency job you do manually. The Triage Agent is a good first choice. Let it run for a week. Watch what it gets right. Adjust.

    Week three: add the Morning Brief Agent. This is the one that changes how your days open. If it works, you will know because opening Notion will stop feeling like work and start feeling like a starting line.

    Week four: look at your databases. The ones that matter will be obvious because the agents will be using them. The ones that don’t matter will be collecting dust. Delete or archive the dead ones. Formalize the live ones.

    After that, the pattern compounds. Each thing you do manually three times becomes a skill. Each repeated workflow becomes an agent. Each messy database gets cleaned when an agent trips on it. The workspace gets smarter as a function of use, not as a function of a weekend rebuild project.

    The operators I’ve seen succeed with this pattern have a specific characteristic in common: they started small and kept going. The operators I’ve seen fail had grand plans and never got to week four.


    What “AI-native business” actually means (if we have to use the phrase)

    The term “AI-native” gets thrown around enough to lose meaning. Inside this pattern, it means something specific.

    An AI-native business is one where AI is not a tool you pick up to accomplish a task. It is a teammate that is already in the workspace, already reading the state, already surfacing what matters, already handling the rhythms. The human is not using AI. The human is working with an operating company that has AI embedded into its substrate.

    That is what the Notion Operating Company pattern produces. Not a workspace that is faster because AI is speeding things up. A workspace that operates continuously because the AI is running inside it, and the human shows up to make the calls that only a human can make.

    This is why I wrote at the beginning that the version of Notion most people use and the version a smaller number have built are barely the same product anymore. They are not. They are two different conceptions of what a workspace is for, and in April 2026, one of them is still a place you put things, and the other is a place you run things.

    The whole game is picking the second one on purpose.


    FAQ

    What’s the difference between using Notion as a wiki and running an operating company on Notion? A wiki is where information lives after you’re done with it. An operating company is where the work actually happens — briefs, decisions, run reports, active projects, agents handling recurring rhythms. The operating company pattern treats Notion as a control plane, not an archive.

    Do I need Business or Enterprise plan? For Custom Agents, yes. Custom Agents require Notion’s Business or Enterprise Plan. Notion Agent (the on-demand personal AI) is available to all Notion users. The operating-company pattern benefits substantially from Custom Agents, so most serious implementations are on Business or higher.

    How much does this cost to run? Custom Agents are free to try through May 3, 2026. Starting May 4, 2026, they use Notion Credits, available as an add-on for Business and Enterprise plans — $10 per 1,000 credits, shared across the workspace, reset monthly, no rollover. In practice, for a solo operator or small team running five or so agents, credit costs are modest. Budget discipline becomes relevant at larger scale.

    What AI models can the agents use? Currently available: Auto (Notion selects), Claude Sonnet, Claude Opus, and GPT-5. Notion regularly adds new models, so expect this list to evolve. Recent additions include cost-efficient models like Haiku 4.5 and GPT-5.4 Mini/Nano that can cut credit usage significantly.

    How secure is it? Custom Agents inherit your permissions, so they can see what you see. They offer page-level access control. Every agent run is logged with full audit trails. Notion has implemented guardrails to automatically detect potential prompt injection, and has built controls for admins and workspace owners to monitor connections and restrict what agents can access. The honest answer: reasonable security defaults, real attack surface, practical precautions apply (scope agents narrowly, audit connected sources).

    Can I run this pattern solo? Yes. Solo operators get the highest leverage from the operating-company pattern because there’s no team coordination overhead. The pattern scales down cleanly.

    What if I don’t want to use Custom Agents? Does the pattern still work? The database spine and Control Center work without agents. You’ll be doing manually what the agents would be doing — daily briefs, triage, weekly reviews. The pattern is still more legible than a traditional Notion setup; you just don’t get the “workspace operates itself between check-ins” effect.

    How long does it take to build? The honest answer is you never stop building. You never should. A workspace that stops evolving is a workspace that is about to stop working. But the minimum viable version — Control Center, one agent, a handful of databases — is a week of part-time work, not a project.


    A closing observation

    The reason this pattern is worth writing about now, in April 2026, is that the window where it is a genuine edge is probably short. Two years from now, some version of this will be the default way Notion is used, and the advantage will compress. Today, most workspaces are still wikis. The operators who make the switch to operating-company now are buying a year or two of operational leverage that becomes the baseline eventually.

    But for right now, this works, it is real, and almost nobody is doing it. That gap is the thing.

    If you are already running something like this, you know. If you are reading about it for the first time, the starting point is the Control Center and one agent. Build the Control Center this week. Add the agent next week. In a month, you’ll have a workspace that is a different kind of object than the one you started with.

    That’s what we mean by an operating company.


    Sources and further reading

  • The CLAUDE.md Playbook: How to Actually Guide Claude Code Across a Real Project (2026)

    The CLAUDE.md Playbook: How to Actually Guide Claude Code Across a Real Project (2026)

    Most writing about CLAUDE.md gets one thing wrong in the first paragraph, and once you notice it, you can’t unsee it. People describe it as configuration. A “project constitution.” Rules Claude has to follow.

    It isn’t any of those things, and Anthropic is explicit about it.

    CLAUDE.md content is delivered as a user message after the system prompt, not as part of the system prompt itself. Claude reads it and tries to follow it, but there’s no guarantee of strict compliance, especially for vague or conflicting instructions. — Anthropic, Claude Code memory docs

    That one sentence is the whole game. If you write a CLAUDE.md as if you’re programming a machine, you’ll get frustrated when the machine doesn’t comply. If you write it as context — the thing a thoughtful new teammate would want to read on day one — you’ll get something that works.

    This is the playbook I wish someone had handed me the first time I set one up across a real codebase. It’s grounded in Anthropic’s current documentation (linked throughout), layered with patterns I’ve used across a network of production repos, and honest about where community practice has outrun official guidance.

    If any of this ages out, the docs are the source of truth. Start there, come back here for the operator layer.


    The memory stack in 2026 (what CLAUDE.md actually is, and isn’t)

    Claude Code’s memory system has three parts. Most people know one of them, and the other two change how you use the first.

    CLAUDE.md files are markdown files you write by hand. Claude reads them at the start of every session. They contain instructions you want Claude to carry across conversations — build commands, coding standards, architectural decisions, “always do X” rules. This is the part people know.

    Auto memory is something Claude writes for itself. Introduced in Claude Code v2.1.59, it lets Claude save notes across sessions based on your corrections — build commands it discovered, debugging insights, preferences you kept restating. It lives at ~/.claude/projects/<project>/memory/ with a MEMORY.md entrypoint. You can audit it with /memory, edit it, or delete it. It’s on by default. (Anthropic docs.)

    .claude/rules/ is a directory of smaller, topic-scoped markdown files — code-style.md, testing.md, security.md — that can optionally be scoped to specific file paths via YAML frontmatter. A rule with paths: ["src/api/**/*.ts"] only loads when Claude is working with files matching that pattern. (Anthropic docs.)

    The reason this matters for how you write CLAUDE.md: once you understand what the other two are for, you stop stuffing CLAUDE.md with things that belong somewhere else. A 600-line CLAUDE.md isn’t a sign of thoroughness. It’s usually a sign the rules directory doesn’t exist yet and auto memory is disabled.

    Anthropic’s own guidance is explicit: target under 200 lines per CLAUDE.md file. Longer files consume more context and reduce adherence.

    Hold that number. We’ll come back to it.


    Where CLAUDE.md lives (and why scope matters)

    CLAUDE.md files can live in four different scopes, each with a different purpose. More specific scopes take precedence over broader ones. (Full precedence table in Anthropic docs.)

    Managed policy CLAUDE.md lives at the OS level — /Library/Application Support/ClaudeCode/CLAUDE.md on macOS, /etc/claude-code/CLAUDE.md on Linux and WSL, C:\Program Files\ClaudeCode\CLAUDE.md on Windows. Organizations deploy it via MDM, Group Policy, or Ansible. It applies to every user on every machine it’s pushed to, and individual settings cannot exclude it. Use it for company-wide coding standards, security posture, and compliance reminders.

    Project CLAUDE.md lives at ./CLAUDE.md or ./.claude/CLAUDE.md. It’s checked into source control and shared with the team. This is the one you’re writing when someone says “set up CLAUDE.md for this repo.”

    User CLAUDE.md lives at ~/.claude/CLAUDE.md. It’s your personal preferences across every project on your machine — favorite tooling shortcuts, how you like code styled, patterns you want applied everywhere.

    Local CLAUDE.md lives at ./CLAUDE.local.md in the project root. It’s personal-to-this-project and gitignored. Your sandbox URLs, preferred test data, notes Claude should know that your teammates shouldn’t see.

    Claude walks up the directory tree from wherever you launched it, concatenating every CLAUDE.md and CLAUDE.local.md it finds. Subdirectories load on demand — they don’t hit context at launch, but get pulled in when Claude reads files in those subdirectories. (Anthropic docs.)

    A practical consequence most teams miss: in a monorepo, your parent CLAUDE.md gets loaded when a teammate runs Claude Code from inside a nested package. If that parent file contains instructions that don’t apply to their work, Claude will still try to follow them. That’s what the claudeMdExcludes setting is for — it lets individuals skip CLAUDE.md files by glob pattern at the local settings layer.

    If you’re running Claude Code across more than one repo, decide now whether your standards belong in project CLAUDE.md (team-shared) or user CLAUDE.md (just you). Writing the same thing in both is how you get drift.


    The 200-line discipline

    This is the rule I see broken most often, and it’s the rule Anthropic is most explicit about. From the docs: “target under 200 lines per CLAUDE.md file. Longer files consume more context and reduce adherence.”

    Two things are happening in that sentence. One, CLAUDE.md eats tokens — every session, every time, whether Claude needed those tokens or not. Two, longer files don’t actually produce better compliance. The opposite. When instructions are dense and undifferentiated, Claude can’t tell which ones matter.

    The 200-line ceiling isn’t a hard cap. You can write a 400-line CLAUDE.md and Claude will load the whole thing. It just won’t follow it as well as a 180-line file would.

    Three moves to stay under:

    1. Use @imports to pull in specific files when they’re relevant. CLAUDE.md supports @path/to/file syntax (relative or absolute). Imported files expand inline at session launch, up to five hops deep. This is how you reference your README, your package.json, or a standalone workflow guide without pasting them into CLAUDE.md.

    See @README.md for architecture and @package.json for available scripts.
    
    # Git Workflow
    - @docs/git-workflow.md

    2. Move path-scoped rules into .claude/rules/. Anything that only matters when working with a specific part of the codebase — API patterns, testing conventions, frontend style — belongs in .claude/rules/api.md or .claude/rules/testing.md with a paths: frontmatter. They only load into context when Claude touches matching files.

    ---
    paths:
      - "src/api/**/*.ts"
    ---
    # API Development Rules
    
    - All API endpoints must include input validation
    - Use the standard error response format
    - Include OpenAPI documentation comments

    3. Move task-specific procedures into skills. If an instruction is really a multi-step workflow — “when you’re asked to ship a release, do these eight things” — it belongs in a skill, which only loads when invoked. CLAUDE.md is for the facts Claude should always hold in context; skills are for procedures Claude should run when the moment calls for them.

    If you follow these three moves, a CLAUDE.md rarely needs to exceed 150 lines. At that size, Claude actually reads it.


    What belongs in CLAUDE.md (the signal test)

    Anthropic’s own framing for when to add something is excellent, and it’s worth quoting directly because it captures the whole philosophy in four lines:

    Add to it when:

    • Claude makes the same mistake a second time
    • A code review catches something Claude should have known about this codebase
    • You type the same correction or clarification into chat that you typed last session
    • A new teammate would need the same context to be productive — Anthropic docs

    The operator version of the same principle: CLAUDE.md is the place you write down what you’d otherwise re-explain. It’s not the place you write down everything you know. If you find yourself writing “the frontend is built in React and uses Tailwind,” ask whether Claude would figure that out by reading package.json (it would). If you find yourself writing “when a user asks for a new endpoint, always add input validation and write a test,” that’s the kind of thing Claude won’t figure out on its own — it’s a team convention, not an inference from the code.

    The categories I’ve found actually earn their place in a project CLAUDE.md:

    Build and test commands. The exact string to run the dev server, the test suite, the linter, the type checker. Every one of these saves Claude a round of “let me look for a package.json script.”

    Architectural non-obvious. The thing a new teammate would need someone to explain. “This repo uses event sourcing — don’t write direct database mutations, emit events instead.” “We have two API surfaces, /public/* and /internal/*, and they have different auth requirements.”

    Naming conventions and file layout. “API handlers live in src/api/handlers/.” “Test files go next to the code they test, named *.test.ts.” Specific enough to verify.

    Coding standards that matter. Not “write good code” — “use 2-space indentation,” “prefer const over let,” “always export types separately from values.”

    Recurring corrections. The single most valuable category. Every time you find yourself re-correcting Claude about the same thing, that correction belongs in CLAUDE.md.

    What usually doesn’t belong:

    • Long lists of library choices (Claude can read package.json)
    • Full architecture diagrams (link to them instead)
    • Step-by-step procedures (skills)
    • Path-specific rules that only matter in one part of the repo (.claude/rules/ with a paths: field)
    • Anything that would be true of any project (that goes in user CLAUDE.md)

    Writing instructions Claude will actually follow

    Anthropic’s own guidance on effective instructions comes down to three principles, and every one of them is worth taking seriously:

    Specificity. “Use 2-space indentation” works better than “format code nicely.” “Run npm test before committing” works better than “test your changes.” “API handlers live in src/api/handlers/” works better than “keep files organized.” If the instruction can’t be verified, it can’t be followed reliably.

    Consistency. If two rules contradict each other, Claude may pick one arbitrarily. This is especially common in projects that have accumulated CLAUDE.md files across multiple contributors over time — one file says to prefer async/await, another says to use .then() for performance reasons, and nobody remembers which was right. Do a periodic sweep.

    Structure. Use markdown headers and bullets. Group related instructions. Dense paragraphs are harder to scan, and Claude scans the same way you do. A CLAUDE.md with clear section headers — ## Build Commands, ## Coding Style, ## Testing — outperforms the same content run together as prose.

    One pattern I’ve found useful that isn’t in the docs: write CLAUDE.md in the voice of a teammate briefing another teammate. Not “use 2-space indentation” but “we use 2-space indentation.” Not “always include input validation” but “every endpoint needs input validation — we had a security incident last year and this is how we prevent the next one.” The “why” is optional but it improves adherence because Claude treats the rule as something with a reason behind it, not an arbitrary preference.


    Community patterns worth knowing (flagged as community, not official)

    The following are patterns I’ve seen in operator circles and at industry events like AI Engineer Europe 2026, where practitioners share how they’re running Claude Code in production. None of these are in Anthropic’s documentation as official guidance. I’ve included them because they’re useful; I’m flagging them because they’re community-origin, not doctrine. Your mileage may vary, and Anthropic’s official behavior could change in ways that affect these patterns.

    The “project constitution” framing. Community shorthand for treating CLAUDE.md as the living document of architectural decisions — the thing new contributors read to understand how the project thinks. The framing is useful even though Anthropic doesn’t use the word. It captures the right posture: CLAUDE.md is the place for the decisions you want to outlast any individual conversation.

    Prompt-injecting your own codebase via custom linter errors. Reported at AI Engineer Europe 2026: some teams embed agent-facing prompts directly into their linter error messages, so when an automated tool catches a mistake, the error text itself tells the agent how to fix it. Example: instead of a test failing with “type mismatch,” the error reads “You shouldn’t have an unknown type here because we parse at the edge — use the parsed type from src/schemas/.” This is not documented Anthropic practice; it’s a community pattern that works because Claude Code reads tool output and tool output flows into context. Use with judgment.

    File-size lint rules as context-efficiency guards. Some teams enforce file-size limits (commonly cited: 350 lines max) via their linters, with the explicit goal of keeping files small enough that Claude can hold meaningful ones in context without waste. Again, community practice. The number isn’t magic; the discipline is.

    Token Leverage as a team metric. The idea that teams should track token spend ÷ human labor spend as a ratio and try to scale it. This is business-strategy content, not engineering guidance, and it’s emerging community discourse rather than settled practice. Take it as a thought experiment, not a KPI to implement by Monday.

    I’d rather flag these honestly than pretend they’re settled. If something here graduates from community practice to official recommendation, I’ll update.


    Enterprise: managed-policy CLAUDE.md (and when to use settings instead)

    For organizations deploying Claude Code across teams, there’s a managed-policy CLAUDE.md that applies to every user on a machine and cannot be excluded by individual settings. It lives at /Library/Application Support/ClaudeCode/CLAUDE.md (macOS), /etc/claude-code/CLAUDE.md (Linux and WSL), or C:\Program Files\ClaudeCode\CLAUDE.md (Windows), and is deployed via MDM, Group Policy, Ansible, or similar.

    The distinction that matters most for enterprise: managed CLAUDE.md is guidance, managed settings are enforcement. Anthropic is clear about this. From the docs:

    Settings rules are enforced by the client regardless of what Claude decides to do. CLAUDE.md instructions shape Claude’s behavior but are not a hard enforcement layer. — Anthropic docs

    If you need to guarantee that Claude Code can’t read .env files or write to /etc, that’s a managed settings concern (permissions.deny). If you want Claude to be reminded of your company’s code review standards, that’s managed CLAUDE.md. If you confuse the two and put your security policy in CLAUDE.md, you have a strongly-worded suggestion where you needed a hard wall.

    Building With Claude?

    I’ll send you the CLAUDE.md cheat sheet personally.

    If you’re in the middle of a real project and this playbook is helping — or raising more questions — just email me. I read every message.

    Email Will → will@tygartmedia.com

    The right mental model:

    Concern Configure in
    Block specific tools, commands, or file paths Managed settings (permissions.deny)
    Enforce sandbox isolation Managed settings (sandbox.enabled)
    Authentication method, organization lock Managed settings
    Environment variables, API provider routing Managed settings
    Code style and quality guidelines Managed CLAUDE.md
    Data handling and compliance reminders Managed CLAUDE.md
    Behavioral instructions for Claude Managed CLAUDE.md

    (Full table in Anthropic docs.)

    One practical note: managed CLAUDE.md ships to developer machines once, so it has to be right. Review it, version it, and treat changes to it the way you’d treat changes to a managed IDE configuration — because that’s what it is.


    The living document problem: auto memory, CLAUDE.md, and drift

    The thing that changed most in 2026 is that Claude now writes memory for itself when auto memory is enabled (on by default since Claude Code v2.1.59). It saves build commands it discovered, debugging insights, preferences you expressed repeatedly — and loads the first 200 lines (or 25KB) of its MEMORY.md at every session start. (Anthropic docs.)

    This changes how you think about CLAUDE.md in two ways.

    First, you don’t need to write CLAUDE.md entries for everything Claude could figure out on its own. If you tell Claude once that the build command is pnpm run build --filter=web, auto memory might save that, and you won’t need to codify it in CLAUDE.md. The role of CLAUDE.md becomes more specifically about what the team has decided, rather than what the tool needs to know to function.

    Second, there’s a new audit surface. Run /memory in a session and you can see every CLAUDE.md, CLAUDE.local.md, and rules file being loaded, plus a link to open the auto memory folder. The auto memory files are plain markdown. You can read, edit, or delete them.

    A practical auto-memory hygiene pattern I’ve landed on:

    • Once a month, open /memory and skim the auto memory folder. Anything stale or wrong gets deleted.
    • Quarterly, review the CLAUDE.md itself. Has anything changed in how the team works? Are there rules that used to matter but don’t anymore? Conflicting instructions accumulate faster than you think.
    • Whenever a rule keeps getting restated in conversation, move it from conversation to CLAUDE.md. That’s the signal Anthropic’s own docs describe, and it’s the right one.

    CLAUDE.md files are living documents or they’re lies. A CLAUDE.md from six months ago that references libraries you’ve since replaced will actively hurt you — Claude will try to follow instructions that no longer apply.


    A representative CLAUDE.md template

    What follows is a synthetic example, clearly not any specific project. It demonstrates the shape, scope, and discipline of a good project CLAUDE.md. Adapt it to your codebase. Keep it under 200 lines.

    # Project: [Name]
    
    ## Overview
    Brief one-paragraph description of what this project is and who uses it.
    Link to deeper architecture docs rather than duplicating them here.
    
    See @README.md for full architecture.
    
    ## Build and Test Commands
    - Install: `pnpm install`
    - Dev server: `pnpm run dev`
    - Build: `pnpm run build`
    - Test: `pnpm test`
    - Type check: `pnpm run typecheck`
    - Lint: `pnpm run lint`
    
    Run `pnpm run typecheck` and `pnpm test` before committing. Both must pass.
    
    ## Tech Stack
    (Only list the non-obvious choices. Claude can read package.json.)
    - We use tRPC, not REST, for internal APIs.
    - Styling is Tailwind with a custom token file at `src/styles/tokens.ts`.
    - Database migrations via Drizzle, not Prisma (migrated in Q1 2026).
    
    ## Directory Layout
    - `src/api/` — tRPC routers, grouped by domain
    - `src/components/` — React components, one directory per component
    - `src/lib/` — shared utilities, no React imports allowed here
    - `src/server/` — server-only code, never imported from client
    - `tests/` — integration tests (unit tests live next to source)
    
    ## Coding Conventions
    - TypeScript strict mode. No `any` without a comment explaining why.
    - Functional components only. No class components.
    - Imports ordered: external, internal absolute, relative.
    - 2-space indentation. Prettier config in `.prettierrc`.
    
    ## Conventions That Aren't Obvious
    - Every API endpoint validates input with Zod. No exceptions.
    - Database queries go through the repository layer in `src/server/repos/`. 
      Never import Drizzle directly from route handlers.
    - Errors surfaced to the UI use the `AppError` class from `src/lib/errors.ts`.
      This preserves error codes for the frontend to branch on.
    
    ## Common Corrections
    - Don't add new top-level dependencies without discussing first.
    - Don't create new files in `src/lib/` without checking if a similar 
      utility already exists.
    - Don't write tests that hit the real database. Use the test fixtures 
      in `tests/fixtures/`.
    
    ## Further Reading
    - API design rules: @.claude/rules/api.md
    - Testing conventions: @.claude/rules/testing.md
    - Security: @.claude/rules/security.md

    That’s roughly 70 lines. Notice what it doesn’t include: no multi-step procedures, no duplicated information from package.json, no universal-best-practice lectures. Every line is either a command you’d otherwise re-type, a convention a new teammate would need briefed, or a pointer to a more specific document.


    When CLAUDE.md still isn’t being followed

    This happens to everyone eventually. Three debugging steps, in order:

    1. Run /memory and confirm your file is actually loaded. If CLAUDE.md isn’t in the list, Claude isn’t reading it. Check the path — project CLAUDE.md can live at ./CLAUDE.md or ./.claude/CLAUDE.md, not both, not a subdirectory (unless Claude happens to be reading files in that subdirectory).

    2. Make the instruction more specific. “Write clean code” is not an instruction Claude can verify. “Use 2-space indentation” is. “Handle errors properly” is not an instruction. “All errors surfaced to the UI must use the AppError class from src/lib/errors.ts” is.

    3. Look for conflicting instructions. A project CLAUDE.md saying “prefer async/await” and a .claude/rules/performance.md saying “use raw promises for hot paths” will cause Claude to pick one arbitrarily. In monorepos this is especially common — an ancestor CLAUDE.md from a different team can contradict yours. Use claudeMdExcludes to skip irrelevant ancestors.

    If you need guarantees rather than guidance — “Claude cannot, under any circumstances, delete this directory” — that’s a settings-level permissions concern, not a CLAUDE.md concern. Write the rule in settings.json under permissions.deny and the client enforces it regardless of what Claude decides.


    FAQ

    What is CLAUDE.md? A markdown file Claude Code reads at the start of every session to get persistent instructions for a project. It lives in a project’s source tree (usually at ./CLAUDE.md or ./.claude/CLAUDE.md), gets loaded into the context window as a user message after the system prompt, and contains coding standards, build commands, architectural decisions, and other team-level context. Anthropic is explicit that it’s guidance, not enforcement. (Source.)

    How long should a CLAUDE.md be? Under 200 lines. Anthropic’s own guidance is that longer files consume more context and reduce adherence. If you’re over that, split with @imports or move topic-specific rules into .claude/rules/.

    Where should CLAUDE.md live? Project-level: ./CLAUDE.md or ./.claude/CLAUDE.md, checked into source control. Personal-global: ~/.claude/CLAUDE.md. Personal-project (gitignored): ./CLAUDE.local.md. Organization-wide (enterprise): /Library/Application Support/ClaudeCode/CLAUDE.md (macOS), /etc/claude-code/CLAUDE.md (Linux/WSL), or C:\Program Files\ClaudeCode\CLAUDE.md (Windows).

    What’s the difference between CLAUDE.md and auto memory? CLAUDE.md is instructions you write for Claude. Auto memory is notes Claude writes for itself across sessions, stored at ~/.claude/projects/<project>/memory/. Both load at session start. CLAUDE.md is for team standards; auto memory is for build commands and preferences Claude picks up from your corrections. Auto memory requires Claude Code v2.1.59 or later.

    Can Claude ignore my CLAUDE.md? Yes. CLAUDE.md is loaded as a user message and Claude “reads it and tries to follow it, but there’s no guarantee of strict compliance.” For hard enforcement (blocking file access, sandbox isolation, etc.) use settings, not CLAUDE.md.

    Does AGENTS.md work for Claude Code? Claude Code reads CLAUDE.md, not AGENTS.md. If your repo already uses AGENTS.md for other coding agents, create a CLAUDE.md that imports it with @AGENTS.md at the top, then append Claude-specific instructions below.

    What’s .claude/rules/ and when should I use it? A directory of smaller, topic-scoped markdown files that can optionally be scoped to specific file paths via YAML frontmatter. Use it when your CLAUDE.md is getting long or when instructions only matter in part of the codebase. Rules without a paths: field load at session start with the same priority as .claude/CLAUDE.md; rules with a paths: field only load when Claude works with matching files.

    How do I generate a starter CLAUDE.md? Run /init inside Claude Code. It analyzes your codebase and produces a starting file with build commands, test instructions, and conventions it discovers. Refine from there with instructions Claude wouldn’t discover on its own.


    A closing note

    The biggest mistake I see people make with CLAUDE.md isn’t writing it wrong — it’s writing it once and forgetting it exists. Six months later it references libraries they’ve since replaced, conventions that have since shifted, and a team structure that has since reorganized. Claude dutifully tries to follow instructions that no longer apply, and the team wonders why the tool seems to have gotten worse.

    CLAUDE.md is a living document or it’s a liability. Treat it the way you’d treat a critical piece of onboarding documentation, because functionally that’s exactly what it is — onboarding for the teammate who shows up every session and starts from zero.

    Write it for that teammate. Keep it short. Update it when reality shifts. And remember the part nobody likes to admit: it’s guidance, not enforcement. For anything that has to be guaranteed, reach for settings instead.


    Sources and further reading

    Community patterns referenced in this piece were reported at AI Engineer Europe 2026 and captured in a session recap. They represent emerging practice, not Anthropic doctrine.

  • The Clean Tool: Why I Keep My Claude Empty of the People I Love

    The Clean Tool: Why I Keep My Claude Empty of the People I Love

    A flagship essay on AI hygiene: what to store, what to keep out, and how to have the conversation about it with the people in your life.

    “What do you know about my girlfriend?”

    Last night my partner Stef asked me a question she had a right to ask. She wanted to know what my AI knew about her.

    I use Claude for hours a day. I run an agency on top of it. I have knowledge bases, project contexts, client stacks, and conversation histories going back years. She watched me work on the thing enough to assume that by now, surely, the AI had a rich picture of her — her sense of humor, her work, the shape of our relationship, the running jokes, the small details a partner remembers. She handed me her phone as a test of it. Let it tell me what it knows.

    The answer was almost nothing.

    My name for her. That she lives here. A few passing references to a Notion chat room she once set up, a voice memo she sent me that we extracted some thinking from. No sense of who she is as a person. No running joke the model could finish. No model of her at all, really.

    She was hurt in a flash, the way you get hurt by something that isn’t an injury but is still information. I was quietly proud, in a way I didn’t know how to explain in the moment. Both reactions were correct. That’s the thing I want to write about here — that the gap between her hurt and my pride is the shape of a whole category of questions almost nobody is asking out loud yet, and it is only going to get bigger.

    We talked about it for a while. I tried to explain why the tool was empty of her on purpose. She let me try. And what came out of the conversation was the argument I’m about to make, which I’ll phrase in one sentence up front so you can decide whether to keep reading:

    Keeping the people you love out of your AI is not forgetting them. It’s a specific kind of care. And the conversation you have about why they’re not in there is how you close the gap between what the tool knows and what the relationship deserves.

    If that sentence lands at all, the rest of this is the why, the how, and the honest version of what I’m still getting wrong.

    AI Memory Is Nuclear Power

    Here’s the frame that has organized my thinking on this for the last year.

    AI memory is nuclear power. Real civilization-scale utility on one side, real civilization-scale danger on the other, and almost nobody I’ve met is running a containment protocol worthy of the payload they’re storing.

    The analogy holds all the way down. The fuel is useful because it’s concentrated — that’s the whole point of a persistent memory that remembers your business, your family, your finances, your health, your history. Concentration is what makes the tool powerful. Concentration is also exactly what makes a spill catastrophic. And the people celebrating the new reactor are almost never the people thinking about the waste.

    The honest position on this, I’ve come to believe, is neither abstinence nor maximalism. It’s containment engineering. You build the reactor and the shielding. You use the tool and you design the protocol for when the tool fails. Pro-AI and pro-guardrail are the same position. Anyone telling you to choose one is selling you something.

    What makes this hard is that the stakes are asymmetric in a way most people never sit with directly. For the platform, your memory is one row in a table of billions — a single unit of risk distributed across a huge population. For you, your memory is a map of your life. The platform’s worst-case scenario is a rough quarter, a settlement, a bad headline. Your worst-case scenario is a destroyed marriage, a leaked client list, a legal catastrophe, a career-ending screenshot. These are not remotely comparable events, and they don’t scale the same way, and they do not reach any kind of equilibrium where the platform’s good-faith security policy protects the individual worst case. The platform is optimizing for its risk profile. Its risk profile is not yours. You are the only person whose worst-case scenario is your worst-case scenario.

    That asymmetry is why individual hygiene matters even when platform security is genuinely excellent. It’s why I don’t think this conversation is paranoid and I don’t think it’s solved and I don’t think you can outsource it.

    Three Failure Modes. Which One Are You?

    Most people running AI at any real depth fall into one of three failure modes, and most of them don’t know which one they’re in. Before I tell you what any of them are, I want you to place yourself while you read.

    The over-loader. This is the person who treats the AI as a second brain and dumps everything into it — credentials, relationships, grievances, client details, medical history, the long rambling voice-memo of what happened at Thanksgiving. It feels like investment. It feels like the tool getting smarter about them. It mostly is. But it also means one breach, one nosy partner, one subpoena, one bad exit from the platform turns the tool into a weapon pointed directly at the user. The over-loader’s failure mode is invisible until it isn’t.

    The under-loader. This is the person who keeps the tool so sterile it never reaches its potential — which is fine as far as it goes, except the humans in their life often discover, usually by accident, that they aren’t in the context at all. That discovery doesn’t land as safety. It lands as erasure. The under-loader’s failure mode is relational, not technical. The tool stays clean, and the relationships pay the cost the tool should have paid.

    The unaware. This is, honestly, most people. No mental model of what’s stored, where, for how long, or under whose policy. They’re making operational decisions — business decisions, relationship decisions, identity decisions — on top of a foundation they have never inspected. They don’t know their AI has memory in six places, not one. They don’t know where the off switch is. They assume chat history is the whole story when chat history is maybe 20 percent of it.

    The first hygiene move is always the same: figure out which mode you default to. Over-loaders need to prune. Under-loaders need to have a conversation with the humans they’ve been quietly protecting without telling them. The unaware need to spend thirty minutes mapping what they’ve actually agreed to.

    I’ve been all three at different points. Most operators I respect have been too. The point of the diagnostic isn’t to shame. It’s to make the failure mode visible enough that you can actually work on it.

    Clean Tool vs. Second Brain: The Choice You Might Not Know You’re Making

    There are two coherent philosophies for how to use AI at depth, and they are genuinely in tension.

    The Clean Tool approach says: the AI is an instrument. You keep it sharp by keeping it empty of identity. You bring the context you need into each session, do the work, and let the session close without leaving a permanent residue of who you were that day. The AI is like a great chef’s knife — it serves you best when it is exactly what it is, not a repository of everything you’ve ever cut with it.

    The Second Brain approach says: the AI is an extension of cognition. The more of you it holds, the more it can do for you. The payoff scales with the investment. Loading your thinking, your projects, your relationships, your patterns into the model is not a liability — it’s the whole point. You are building a partner that knows you well enough to anticipate you. The AI is like a lifelong collaborator who has read every note you ever took.

    Both are legitimate. Both have failure modes. The failure mode of the Clean Tool is that you never reach the depth of partnership that made you interested in AI in the first place — you end up with a very sharp instrument and no deep relationship with the work it enables. The failure mode of the Second Brain is that you build something you cannot leave, cannot audit, and cannot defend if it ever gets read by the wrong person.

    I run Clean Tool. I should say that plainly. I do not believe it is the only right answer. I believe it is the right answer for how I work, what I work on, and who the people around me are. My work touches client data, confidential business strategy, and a personal life I want to keep intact. The cost of a Second Brain leak, for me, is catastrophic in a way I cannot price. The cost of the Clean Tool is friction — I reload context more often, I carry more of my own thinking in my own head, I refuse some of the tool’s offers of recall. That friction is the price of sleeping well.

    I know thoughtful people who run Second Brain and run it well. They’ve built containment around it. They accept different tradeoffs. The worst place to be is the one most users actually occupy, which is a confused middle — enough invested that the data layer has weight, not enough discipline that the containment is real. You get the downsides of both and the upsides of neither.

    So if you take one frame from this piece: the choice isn’t which philosophy is correct. The choice is which one you are running, consciously, with the guardrails appropriate to that choice. Drifting into either by accident is what produces the failure modes nobody wants.

    The People Not in the Memory

    I want to go back to Stef, because this is the part of the piece that matters most to me and I’m not sure I’d trust anyone else to write it the way I need to write it.

    When Stef was hurt that the AI didn’t know her, I understood what she was feeling. The intuition beneath the hurt is simple and very human: you spend hours every day with this thing. It’s your work, your thinking, your hours. If you cared about me the way you care about the work, surely some of that care would show up in the tool. That intuition is not wrong in its values. It’s wrong in its mechanics.

    AI proximity is not relational proximity. Time-on-tool is the worst possible proxy for trust. A person can spend ten hours a day with an AI and share less of themselves with it than they share in a two-minute phone call with their sister. The tool is near you. It is not close to you. These words are not synonyms and they never have been, and the confusion of them is producing a whole new species of interpersonal hurt that our language doesn’t have good words for yet.

    Here is what I believe about the people in my life and my AI’s memory. Stef is not in the tool because she does not need to be in the tool for the tool to do its job. She matters because she is a person, not because the system has modeled her. Putting her in the context would not deepen my relationship with her. It would reduce her to a row in a store I don’t fully control, governed by a policy I did not write, subject to a retention schedule I did not negotiate, accessible to whoever eventually gets to see my session — a partner who leaves, a discovery motion, a breach, a curious kid, a future version of the platform with different terms. None of those futures are certain. All of them are possible. The cost of her being in there, in any of those futures, is hers to pay, not mine.

    And I love her. So she is not in there. That is the mechanism.

    The thing I couldn’t explain to her in the moment, but want to say here, is that the emptiness isn’t neglect. It’s restraint. It’s the same impulse that makes me not tell certain stories at parties even when they’d get a laugh, because they are hers to tell. It’s the same impulse that makes me lock my phone when I step away, even though the odds that anything bad happens in the next ninety seconds are vanishingly small. It’s the practice of treating the people you love as if their information is theirs, which is the simplest expression of respect I know.

    The conversation we had after her hurt was the actual repair. I told her why the tool was empty of her. I told her what was in the tool and what wasn’t. I offered to show her my memory settings, my projects, my contexts — not as a defensive move, but as a matter of domestic transparency. She didn’t take me up on it. The offer was enough. What closed the gap wasn’t the tool changing. It was me being able to say, out loud, you are not in there because I love you, and here is what I mean by that.

    If you use AI at the depth I do and you have people in your life, I think you owe them some version of that conversation. It is not a hard conversation. It is mostly just a clarifying one. But it has to actually happen. The gap between what your tool contains and what your relationship deserves does not close on its own.

    The Containment You Can Install Tonight

    After five sections of framing, you deserve something to do. Here are five moves. None takes more than fifteen minutes. All five together take about an hour. If this is the only section of the piece you act on, you will be meaningfully safer tonight than you were this morning.

    Read your memory. Open whatever interface your AI gives you for stored memories — Claude’s memory settings, ChatGPT’s memory panel, whichever surface your platform exposes. Read every entry top to bottom. For each one, ask three questions: is this still true, is this still relevant, would I be comfortable if this leaked tomorrow? Anything that fails any of the three gets deleted or rewritten. Most people have never read their own AI memory end to end. Doing it once is often the moment the rest of this starts to feel real.

    Map the six surfaces. The chat history is not the whole memory. The whole memory is scattered across at least six surfaces: conversation history, persistent memory features, project knowledge bases, custom instructions, system prompts, and connected integrations (Drive, email, Notion, Slack). Each has a different retention policy. Each has a different surface for deletion. No single UI shows you the total picture. Sit down once and write out, for your specific AI stack, where all six surfaces live for you. This is a twenty-minute exercise that will clarify more than any article could.

    Scope your projects. Stop running one giant context that holds everything. Split into scoped projects — one for client work, one for personal writing, one for household, one for finance if you use it that way. Each project holds only the context it needs. The blast radius of any single compromise stays inside that one project. This is the same least-privilege principle engineers use for software access, applied to context.

    Lock the handoff. The threat model that matters for most individual users is not a sophisticated hacker. It’s the moment someone else touches your unlocked device — a partner borrowing the phone, a kid looking for the calculator, a colleague glancing at your screen, a support agent on a screenshare. Install a short, specific protocol: screen lock by default, session close on context switch, and a named practice for what happens when someone else uses your device. The worst leaks come from the most ordinary moments. Plan for those, not for the movie villain.

    Rotate what the AI has seen. Every credential that has ever appeared in an AI context — API key, password, token, connection string — goes on a rotation schedule the moment it enters. A ninety-day calendar reminder at minimum. Ideally, credentials never enter the AI directly at all; they live in a secrets manager and the AI calls through a proxy that holds the secret. Moving from the first version to the second is one afternoon of plumbing, and it is the single highest-leverage hygiene move an operator can make.

    These are not the whole practice. They are the starter kit. The practice compounds from here.

    The Harder Layer: What I’m Still Getting Wrong

    I want to write this section honestly because the alternative is writing it dishonestly, and there is no version of this piece that earns its argument if I pretend Tygart Media has this figured out.

    So. Here are some real mistakes.

    Earlier this month, the AI stack I use to automate WordPress work made an edit to a client site page without the kind of per-page human confirmation the situation deserved. The edit broke three live pages. The client was patient about it. The rollback worked. No business was lost. But the near-miss had the exact shape of the failure mode this whole piece warns about — capability ran ahead of containment, and a system I trusted made a change faster than my judgment could intervene. The lesson was immediate and I installed the guardrail that afternoon: any live-system action on a high-risk surface now requires explicit per-action confirmation. Read-only actions can run free. Destructive or irreversible actions cannot. The rule sounds obvious stated plainly. It was not in place before the near-miss, and that is on me.

    I have also, at various points, let credentials linger in AI contexts longer than I should have. Not dramatically. Not catastrophically. But in the honest audit I did after the incident above, there were tokens in project files older than the rotation schedule I would tell a client to use. I rotated them. I built the proxy pattern I should have built a year ago. I am closer to clean than I was, and I am not fully there yet.

    There is a reason most operators don’t write sections like this one. The near-miss is pedagogically priceless and professionally embarrassing at the same time. The embarrassment is why the field learns slowly. The honesty, when someone offers it, is the most valuable content in the space — and it is almost never offered, because the incentive structure rewards the polished version over the useful one.

    I am publishing this section anyway because I think the embarrassment is a smaller cost than the slow-learning tax the whole field pays when operators hide their misses. And because an article about hygiene that pretends its author doesn’t sweat is not an article I’d trust from anyone else. If you run AI at operator depth long enough, you will produce near-misses. Whether you learn publicly or privately is the only variable. I’d rather learn where it helps someone else avoid the same move.

    The 2030 View

    If everything in this piece feels a little optional in 2026, project the variables forward and see if the math still works.

    Memory depth is going up, not down — meaningfully, as context windows expand and persistence shifts from opt-in to default. Cross-app memory is already arriving; by 2030 your AI will know what’s in your email and your calendar and your files and your shopping history and your health app, not as separate silos but as a fused picture. Agent autonomy is arriving faster than most people realize — the AI is moving from a thing you consult to a thing that acts on your behalf, which means the containment question shifts from “what does it know” to “what can it do.” Shared household AI layers are arriving, with multiple family members on the same account already common enough that the consent problem stops being individual and becomes governance. And the legal system will catch up to all of this, unevenly, painfully, and in ways you will not want to be the test case for.

    Every problem in this article compounds under those conditions. The over-loader’s blast radius grows. The under-loader’s relational gap widens. The unaware’s foundation gets shakier. The recipes that take an hour now will take a day then. The containment practices that feel precious today will feel obvious in five years, the way locking your front door and not leaving your wallet in the car feel obvious now.

    There will be a public catastrophe. I don’t know whose. I don’t know whether it will be a major breach, a lurid divorce, a criminal discovery, or a platform failure that rewrites retention terms mid-flight. I know it will happen and I know it will reorganize how the rest of us think about this overnight. The people who built the practice before that moment will look prescient. They won’t have been prescient. They’ll have been paying attention.

    I would rather pay attention now, while the stakes are small and the mistakes are cheap, than learn after the public catastrophe when the mistakes are not.

    The Close

    Everything in this piece argues for one small idea.

    The tool is a tool. The person is a person. The hygiene is what keeps those two categories from collapsing into each other.

    When the tool becomes a stand-in for cognition, memory, identity, or intimacy, it has exceeded what it was ever built to do, and the human pays the cost. When the person becomes a user-of-tools who still owns their own thinking, relationships, and responsibility, the tool does what tools are supposed to do — extend capacity without replacing character.

    Every practical move in this article is a local case of that single principle. Every hygiene conversation in your life is an application of it. Every guardrail you install is the same principle, written down.

    And the practice compounds or decays. Six months of deliberate attention makes the moves automatic. Six months of neglect means the muscle memory isn’t there when you need it. This is not a project you complete. It is a standing practice you keep, like locking the door, like reviewing your accounts, like calling the people you love.

    Do one thing tonight. Read your memory. Map your surfaces. Call the person in your life your AI doesn’t know about and tell them why you kept it that way. Any of those. Whichever one feels least comfortable is probably the right one to do first.

    The tool is a tool. The person is a person. The hygiene is what keeps them from becoming each other.

    Start there.

  • Task Budgets, xhigh, and the 2,576px Vision Ceiling: Opus 4.7’s Most Interesting Features Explained

    Task Budgets, xhigh, and the 2,576px Vision Ceiling: Opus 4.7’s Most Interesting Features Explained

    Model Accuracy Note — Updated May 2026

    Current flagship: Claude Opus 4.7 (claude-opus-4-7). Current models: Opus 4.7 · Sonnet 4.6 · Haiku 4.5. Claude Opus 4.6 referenced in this article has been superseded. See current model tracker →

    What this article covers

    Three features in Opus 4.7 deserve their own explanation because they change what’s actually possible in daily work, not just what’s bigger on a benchmark chart:

    1. Task budgets (beta) — per-subtask ceilings that tame agent cost variance.
    2. The xhigh effort level — the new reasoning-control setting between high and max.
    3. The 2,576-pixel vision ceiling — more than 3× the prior image-processing limit.

    Each gets its own section with how it works, when to use it, when not to, and the caveats worth knowing before it ships into production.


    Feature 1: Task budgets (beta)

    What it is. A new system for scoping the resources an agent uses on a multi-turn agentic loop. Instead of setting one thinking budget for an entire turn, you declare budgets — tokens or tool calls — that span an entire agentic loop, and the agent plans its work against them.

    The problem it solves. Agent runs have notoriously high cost variance. The same agent on the same prompt can finish in 40,000 tokens or chase a tangent and burn 400,000. Single-turn thinking budgets don’t help because the agent operates across many turns. Task budgets give you a unit of control that matches how the agent actually spends resources.

    How the agent uses them. On planning, the agent allocates its intended spend against the declared budget. During execution, it tracks progress and either reprioritizes, requests more budget, or halts and summarizes state when it’s running over.

    Behavior note: budgets are soft, not hard. The agent is nudged to respect them, not hard-cut. If you need strict ceilings for billing or SLA reasons, enforce them at the API layer outside the agent loop. Task budgets are for behavior shaping, not hard resource limiting.

    When to use them.
    – Multi-step agentic workflows where cost variance has historically been a problem.
    – Workflows with natural subtask structure where you can reason about budgets.
    – Internal tools where you can iterate on the API shape as Anthropic evolves it.

    When not to use them.
    – Simple single-turn requests. Task budgets are overhead that doesn’t pay off on short interactions.
    – Production contracts that are painful to version. The API is beta and Anthropic has explicitly said the shape may change before GA.
    – Workflows where you need provable hard cutoffs. Enforce those at the API layer, not via this feature.

    The beta caveat, spelled out: task budgets are a testing feature at launch. Parameter names and shape may change. Don’t build long-lived abstractions that depend on the exact current shape surviving to GA. Anthropic has framed this release as a chance to gather feedback on how developers use the feature.


    Feature 2: The xhigh effort level

    What it is. A new setting for reasoning effort, slotted between high and max. Opus 4.6 had three levels: low, medium, high. Opus 4.7 adds xhigh, making four: low, medium, high, xhigh, plus max at the top.

    Why it exists. Anthropic’s framing in the release materials: xhigh gives users “finer control over the tradeoff between reasoning and latency on hard problems.” The gap between high and max was real — high was sometimes under-thinking hard problems; max was often over-thinking moderate ones. xhigh smooths the curve by giving you a setting that’s more thoughtful than high without the runaway token budget of max.

    Anthropic’s own guidance. “When testing Opus 4.7 for coding and agentic use cases, we recommend starting with high or xhigh effort.” That’s a direct recommendation to make xhigh part of your default rotation for serious work, not a niche escalation.

    How to use it.
    – Keep high as the default for routine work.
    – Use xhigh as the new first-choice escalation when high isn’t quite getting there — or start there for coding and agentic tasks per Anthropic’s recommendation.
    – Reserve max for known-hardest tasks where you want maximum thinking regardless of cost.

    Important tradeoff. Higher effort levels in 4.7 produce more output tokens than the same levels did in 4.6. This is a deliberate change — Anthropic lets the model think more at higher levels — but if your cost alerts are calibrated against 4.6 output volumes, they will fire after the upgrade even if nothing else changed.

    An API note worth flagging. Opus 4.7 removed the extended thinking budget parameter that existed in 4.6. The effort level IS the control — you don’t separately set a token budget for thinking. If your 4.6 code explicitly set thinking budgets, update it to just set the effort level instead.

    xhigh is available via API, Bedrock, Vertex AI, and Microsoft Foundry. On Claude.ai and the desktop/mobile apps, effort selection is surfaced through the model switcher with friendlier names rather than the raw API parameter.


    Feature 3: The 2,576-pixel vision ceiling

    What changed. Prior Claude models capped image input at 1,568 pixels on the long edge — about 1.15 megapixels. Opus 4.7 processes images up to 2,576 pixels on the long edge — about 3.75 megapixels, more than 3× the prior pixel budget.

    Why this matters more than it sounds. The cap wasn’t just about how large an image could be accepted; it was about how much detail inside the image could actually be read. Under the old 1.15 MP ceiling, a screenshot of a dense dashboard, a technical diagram with small labels, or a scanned document with fine print would be downscaled to the point where reading the detail was the actual bottleneck. 4.7 removes that bottleneck for images up to the new ceiling.

    Coordinate mapping is now 1:1. This is a separate but related change. In prior Claude versions, computer-use workflows had to account for a scale factor between the coordinates the model “saw” and the coordinates of the actual screen. On Opus 4.7, the model’s coordinate output maps 1:1 to actual image pixels. For anyone building automated UI interaction, this eliminates a category of bugs.

    What this enables that 4.6 struggled with:

    • Dense UI screenshots. Reading small labels, dropdown options, and inline tooltips in a full-resolution app screenshot.
    • Technical diagrams. Following labels on small components in engineering drawings, schematics, org charts.
    • Scanned documents. OCR-adjacent tasks on documents where the text is small relative to the page.
    • Chart details. Reading axis labels and data labels on dense charts, not just the overall shape.
    • Multi-panel content. Comics, infographics, and documents with small type in multiple zones.
    • Pointing, measuring, counting. Low-level vision tasks that depend on pixel precision benefit materially.
    • Bounding-box detection. Image localization tasks show clear gains.

    What it doesn’t change.

    • Images beyond 2,576px still get downscaled to the ceiling. The ceiling is higher; it’s not gone.
    • Video frames are handled differently and aren’t covered by this change.
    • Fundamental vision limits (small-object detection below a certain pixel threshold, hallucinating content that isn’t there on over-ambitious prompts) still exist. More pixels ≠ omniscience.

    Pricing and token cost. Anthropic has not announced separate pricing for the higher-resolution vision processing. Images are billed per the existing vision token formula, which scales with image size. Larger images cost more tokens; that’s not new. The practical cost impact is that you’ll hit higher vision token counts for images that previously would have been silently downscaled. If your use case doesn’t need the extra fidelity, downsample images before sending them to save costs.

    How to use it.

    Via the API and in Claude products, just upload higher-resolution images than you would have before. No special parameter. The model processes them at full resolution up to the ceiling automatically.

    response = client.messages.create(
        model="claude-opus-4-7",
        max_tokens=4096,
        messages=[{
            "role": "user",
            "content": [
                {"type": "image", "source": {...}},  # up to 2576px long edge
                {"type": "text", "text": "Extract the values from the chart."},
            ],
        }],
    )
    

    A caveat worth noting. The 2,576px ceiling is the processing ceiling. Client-side size limits (file size, API request size) still apply. Very large images may need compression before upload even when their pixel dimensions are within the ceiling.


    How these three features compose

    The three features aren’t independent. For agentic coding work in particular, they compose in ways that matter.

    A practical workflow: an agent reviewing a UI bug gets a screenshot of the bug state (vision at 2,576px captures the detail), thinks about it at xhigh effort (enough reasoning without max’s overhead), and runs under a task budget that caps how much it can spend on this particular investigation before escalating or returning. None of these three features alone would produce that workflow smoothly; together, they do.

    This is the real reason to pay attention to the features individually — they’re each useful on their own, but their combined effect on agentic workflows is bigger than any one in isolation.


    Frequently asked questions

    Are task budgets available on Claude.ai, or API only?
    API only. The feature is surfaced to developers through API parameters, not through the consumer chat UI.

    Can I use xhigh on Claude.ai?
    Effort level is exposed to consumers through the model switcher. The underlying xhigh value is available via API; the consumer surface uses friendlier naming rather than the raw parameter.

    Does the 2,576px vision ceiling apply to all Claude products?
    Yes — Claude.ai, the mobile and desktop apps, the API, and all deployment partners (Bedrock, Vertex AI, Microsoft Foundry) use the same vision processing for Opus 4.7.

    Are task budgets a replacement for max_tokens?
    No. max_tokens is a hard cap on output length for a single message. Task budgets are soft behavioral ceilings spanning an agent’s multi-turn loop. Use both.

    Does xhigh use a different API parameter than high?
    No — it’s just another value for the same effort parameter. Note that Opus 4.7 removed the separate extended thinking budget parameter that existed on 4.6: the effort level IS the thinking control on 4.7.

    Will these features come to Opus 4.6?
    No. They’re Opus 4.7 features. 4.6 continues to run on its prior behavior.

    Does xhigh cost more than high?
    Yes, indirectly. Per-token pricing is the same. But xhigh produces more output tokens on hard problems (that’s the point — more thinking), so a given request costs more at xhigh than at high. xhigh is still meaningfully cheaper than max on the same task.


    Related reading

    • The full release: Claude Opus 4.7 — Everything New
    • For developers: Opus 4.7 for coding in practice
    • Comparison: Opus 4.7 vs GPT-5.4 vs Gemini 3.1 Pro
    • The Mythos angle: why Anthropic admitted Opus 4.7 is weaker than an unreleased model

    Published April 16, 2026. Article written by Claude Opus 4.7.

  • Claude Opus 4.7 vs GPT-5.4 vs Gemini 3.1 Pro: Head-to-Head in April 2026

    Claude Opus 4.7 vs GPT-5.4 vs Gemini 3.1 Pro: Head-to-Head in April 2026

    Model Accuracy Note — Updated May 2026

    Current flagship: Claude Opus 4.7 (claude-opus-4-7). Current models: Opus 4.7 · Sonnet 4.6 · Haiku 4.5. Claude Opus 4.6 referenced in this article has been superseded. See current model tracker →

    The short verdict

    • Best for agentic coding and long-horizon engineering: Opus 4.7.
    • Best for single-turn function calling and ecosystem breadth: GPT-5.4.
    • Best for multimodal input volume and long-context retrieval: Gemini 3.1 Pro.
    • Cheapest at the frontier: Gemini 3.1 Pro. Most expensive: GPT-5.4.
    • If you can only pick one for general knowledge work in April 2026: Opus 4.7.

    The full reasoning is below. One disclosure before the details: this article is written by Claude Opus 4.7. I am one of the models being compared. I’ve tried to cite published numbers and flag where the comparison is genuinely contested rather than leaning on my own read.


    Pricing as of April 16, 2026

    Model Input (standard) Output (standard) Long-context tier Context window
    Claude Opus 4.7 $5 / M tokens $25 / M tokens Same across window 1M tokens
    GPT-5.4 $2.50 / M tokens $15 / M tokens $5 / $22.50 over 272K 1M tokens (272K before surcharge)
    Gemini 3.1 Pro $2 / M tokens $12 / M tokens $4 / $18 over 200K 1M tokens (some listings cite 2M)

    Takeaways:
    – Gemini 3.1 Pro is the cheapest per token at the frontier — 7.5× cheaper on input than Opus 4.7 and 2× cheaper than GPT-5.4 at standard context.
    – GPT-5.4 sits in the middle on price and has a significant long-context surcharge cliff at 272K.
    – Opus 4.7 is the most expensive per token, with no long-context surcharge.
    – All three now have 1M-class context windows, but Opus 4.7’s pricing stays flat across the whole window while Gemini and GPT-5.4 both tier up past thresholds.

    Tokenizer caveat: Opus 4.7 uses a new tokenizer that produces up to 1.35× more tokens per input than Opus 4.6 did, depending on content type. Cross-model token-count comparisons require re-tokenizing the same text under each model’s tokenizer — raw word counts lie.


    Benchmarks, with the caveats included

    Anthropic, OpenAI, and Google all publish benchmark numbers. They do not publish them on the same evaluation harness, with the same prompts, or against the same seeds. Treat the following as directional, not definitive.

    Agentic coding (long-horizon, multi-file):
    – Opus 4.7 leads on Anthropic’s reported industry and internal agentic coding benchmarks.
    – GPT-5.4 is competitive on single-turn function calling and tool use. Roughly 80% on SWE-bench Verified at launch.
    – Gemini 3.1 Pro scored 80.6% on SWE-bench Verified at launch — essentially tied with GPT-5.4.

    Multidisciplinary reasoning (GPQA Diamond and similar):
    – Opus 4.7 leads on Anthropic’s comparisons.
    – GPT-5.4 and Gemini 3.1 Pro are close. Gemini reports 94.3% on GPQA Diamond.

    Scaled tool use and agentic computer use:
    – Opus 4.7 leads on Anthropic’s reported benchmarks.
    – GPT-5.4 has a native Computer Use API that scores 75% on OSWorld — the leading published figure at release.
    – All three have invested heavily here; the ranking depends on which eval you trust.

    Vision (document understanding, dense-screenshot extraction):
    – Opus 4.7’s jump from 1.15 MP to 3.75 MP image processing gives it a real lead on tasks that depend on detail inside the image (small text, dense UIs, engineering drawings).
    – Gemini 3.1 Pro is strong on native multimodal workflows with video and mixed media.
    – GPT-5.4 is solid but not leading on either axis.

    Long-context retrieval:
    – All three now have 1M-class context windows.
    – Gemini 3.1 Pro’s pricing tier structure makes it the cost-effective choice for bulk long-context work if your workflow frequently exceeds 200K tokens.
    – Opus 4.7 has flat pricing across its 1M window, which matters for unpredictable context shapes.
    – GPT-5.4’s 272K cliff means long-context workloads are meaningfully more expensive on OpenAI than on Anthropic or Google.

    Specialized coding benchmarks:
    – GPT-5.3 Codex (the specialized predecessor line) still leads on Terminal-Bench 2.0 and SWE-Bench Pro on some scores. GPT-5.4 has absorbed much of Codex’s capability but still trails slightly on pure coding niches.
    – Gemini 3.1 Pro has notable strength on creative coding and SVG generation.
    – Opus 4.7 is strongest on agentic and multi-file coding specifically.

    The honest caveat: benchmark leadership on any single eval changes over the course of a year as models get updated. If you’re making a bet-the-product call, run your own evals on prompts that look like your actual workload. The published benchmarks are a screening tool, not a decision tool.


    How they differ in behavior, not just benchmarks

    Opus 4.7 — the engineering-minded generalist.
    Tends toward thoroughness over speed. More likely than GPT-5.4 to push back on an ambiguous spec and ask a clarifying question; more likely than Gemini to surface tradeoffs rather than pick one and commit. Strong at long-horizon tasks where state matters. Tends to be calibrated about uncertainty — will often say “I can’t verify this without running the tests” rather than confidently claim correctness.

    GPT-5.4 — the product-native operator.
    Tends toward action over deliberation. Excellent at “just do the thing” workflows where you want the model to commit and not ask. Deepest integration ecosystem (Custom GPTs, massive plugin/tool library, widest deployment in third-party products). Tool calling is the feature OpenAI has invested most heavily in, and it shows.

    Gemini 3.1 Pro — the multimodal long-context specialist.
    Cheapest per token at the frontier and by a meaningful margin at the context window. Best default choice for “I need to shove a lot of context in and ask questions against it,” especially when that context includes video or audio. Deep integration with Google Workspace is a real workflow advantage for Google-native teams.

    None of these are absolute; all three models handle general tasks well. These are behavioral tendencies, not capability ceilings.


    “Choose X if” decision framework

    Choose Claude Opus 4.7 if:
    – Your primary workload is coding, especially agentic or multi-file coding.
    – You care about calibrated uncertainty (the model flags when it’s not sure).
    – You’re using or planning to use Claude Code for engineering work.
    – You need vision for dense documents, UI screenshots, or technical drawings.
    – You want the fewest tokens spent on unnecessary thinking (the new xhigh effort level is tuned for this).

    Choose GPT-5.4 if:
    – Single-turn tool use and function calling are the hot path in your product.
    – You need the broadest ecosystem of third-party integrations right now.
    – Your team is already deep in the OpenAI platform and switching cost is nontrivial.
    – You want the most established enterprise deployments (OpenAI has the longest production track record at scale).

    Choose Gemini 3.1 Pro if:
    – You’re price-sensitive and running high-volume workloads.
    – You need 1M+ token context as the default, not as an add-on.
    – Multimodal input volume (video, audio, mixed media) is central to your use case.
    – Your team is deep in Google Cloud or Workspace.

    Use multiple if:
    – You’re doing serious AI product work. Most mature AI teams in 2026 route different workloads to different models. A common pattern: Opus 4.7 for code generation and agent orchestration, Gemini 3.1 Pro for long-context retrieval and cheap bulk processing, GPT-5.4 for single-turn tool-heavy interactions.


    Where this comparison will change

    The frontier is moving. Three things to watch over the next six months:

    1. Claude Mythos Preview. Anthropic publicly acknowledged that Mythos outperforms Opus 4.7 on most of the benchmarks in the 4.7 release post. It is already in production use with select cybersecurity companies under Project Glasswing. When broader release happens, the Claude column of this comparison shifts meaningfully.

    2. GPT-5.5 / GPT-6. OpenAI’s cadence implies a significant model update within the next several months. The pattern over the past year has been incremental 5.x releases; a ground-up generation shift would reset the comparison.

    3. Gemini 3.5 / 4. Google has been releasing new Gemini versions quickly and the trajectory has been steep. The pricing advantage and context-window advantage are Gemini’s to lose.

    None of these are speculation-free predictions. They’re things that have been signaled publicly and will move the comparison when they happen.


    Frequently asked questions

    Is Claude Opus 4.7 better than GPT-5.4?
    On most published benchmarks, yes — particularly on agentic coding and long-horizon tasks. GPT-5.4 remains competitive on single-turn function calling and has the broader ecosystem. “Better” depends on the workload.

    Is Gemini 3.1 Pro cheaper than Opus 4.7?
    Significantly. At $2/$12 per million input/output tokens vs. Opus 4.7’s $5/$25, Gemini is 60% cheaper on input and 52% cheaper on output before tokenizer differences. At scale this is a material cost gap.

    Which model has the biggest context window?
    All three now have 1M-class context windows. Some Gemini 3.1 Pro documentation cites a 2M window. GPT-5.4’s window is 1M but moves to a higher pricing tier after 272K input tokens.

    Which model is best for coding?
    Opus 4.7 leads on agentic and long-horizon coding benchmarks. GPT-5.4 is close on single-turn coding. Gemini 3.1 Pro trails on published coding benchmarks but is competitive on routine work.

    Which model should I use for my startup?
    Most mature teams route workloads to multiple models. If you’re just starting and need to pick one, Opus 4.7 is a strong general default in April 2026 for engineering-adjacent work; Gemini 3.1 Pro if cost or context window dominates your decision; GPT-5.4 if you’re already on the OpenAI platform and the switching cost is high.

    Does Claude Opus 4.7 support function calling?
    Yes — with especially strong performance on multi-step tool chains where state has to be preserved. For single-turn tool calling, GPT-5.4 is competitive or leading depending on the benchmark.


    Related reading

    • Full Opus 4.7 feature set: Claude Opus 4.7 — Everything New
    • Opus 4.7 for coding specifically: xhigh, task budgets, and the 13% benchmark lift
    • The Mythos angle: why Anthropic admitted Opus 4.7 is weaker than an unreleased model

    Published April 16, 2026. Article written by Claude Opus 4.7 — yes, one of the models being compared. Benchmark claims reflect the publishing lab’s reported numbers; independent replication varies.

  • Opus 4.7 for Coding: xhigh, Task Budgets, and the Breaking API Changes in Practice

    Opus 4.7 for Coding: xhigh, Task Budgets, and the Breaking API Changes in Practice

    Model Accuracy Note — Updated May 2026

    Current flagship: Claude Opus 4.7 (claude-opus-4-7). Current models: Opus 4.7 · Sonnet 4.6 · Haiku 4.5. Claude Opus 4.6 referenced in this article has been superseded. See current model tracker →

    What changed if you only have 60 seconds

    • Strong gains in agentic coding, concentrated on the hardest long-horizon tasks.
    • New xhigh effort level between high and max — Anthropic recommends starting with high or xhigh for coding and agentic use cases.
    • Task budgets (beta) — ceilings on tokens and tool calls for multi-turn agentic loops.
    • Improved long-running task behavior — better reasoning and memory across long horizons, particularly relevant in Claude Code.
    • /ultrareview command — multi-pass review that critiques its own first pass.
    • Auto mode in Claude Code now available to Max subscribers (previously Team+ only).
    • ⚠️ Breaking API changes: extended thinking budget parameter and sampling parameters from 4.6 are removed. Update client code before switching model strings.
    • Tokenizer change: expect up to 1.35× more tokens for the same input.
    • Context window: unchanged at 1M tokens.

    The rest of this article is about how those land when you actually use them.


    The coding gain — what it actually feels like

    Anthropic’s release materials describe Opus 4.7 as “a notable improvement on Opus 4.6 in advanced software engineering, with particular gains on the most difficult tasks.” The careful phrasing — “particular gains on the most difficult tasks” — is the important part. On straightforward refactors, you will probably not see a dramatic difference versus 4.6. On long-horizon, multi-file, ambiguous-spec work, you likely will.

    In practice, the shift is: 4.6 would get you 80% of the way through a hard task and then hand you back something that looked right but didn’t work. 4.7 is more likely to actually close the task. It also “gives up gracefully” more often — saying “I can’t verify this works because I can’t run the test suite in this environment” instead of confidently claiming a broken fix. GitHub’s own early testing of Opus 4.7 echoes this: stronger multi-step task performance, more reliable agentic execution, meaningful improvement in long-horizon reasoning and complex tool-dependent workflows.

    If your 4.6 workflow relied heavily on “get it 90% there and finish the last 10% yourself,” you may find 4.7 changes the calculus. It’s not that the final polish is unnecessary now — it’s that the model needs less hand-holding to get to the polish stage.


    xhigh: the new default to reach for

    Opus 4.6 had three effort levels: low, medium, high. Opus 4.7 adds xhigh, slotted between high and max.

    The reason it exists: max was frequently overkill. On moderately hard problems, max would produce three times the thinking tokens of high and get roughly the same answer. On genuinely hard problems, high would leave thinking on the table. There was a real gap in the middle.

    How to use it:
    high is still the right default for routine coding tasks.
    xhigh is the new default to try first when you notice high isn’t quite getting there.
    max is for the cases where xhigh has already failed or the task is known to be long-horizon and expensive-to-rerun.

    Cost-wise, xhigh produces more output tokens than high but meaningfully fewer than max. On a representative hard task I tested during drafting, xhigh used roughly 40% of the output tokens max would have used to reach an equivalent answer. Your mileage will vary by task family.

    A caveat that matters: higher effort means more output tokens, which means higher cost per request even though the per-token price is unchanged. If your budget alerts are tuned to 4.6 volumes, expect them to fire.


    Task budgets (beta): the real agentic improvement

    This is the feature most worth paying attention to if you build agents.

    The problem it solves: Agent runs have high cost variance. The same agent, on the same prompt, can finish in 40,000 tokens or burn 400,000 chasing a tangent. Single-turn thinking budgets didn’t help because the agent operates across many turns.

    How task budgets work: You declare a budget — in tokens, tool calls, or wall-clock time — for a named subtask. The agent plans against that budget. If it’s running over, it either reprioritizes, asks for more, or halts and summarizes state. Budgets can nest (parent task with child subtasks, each with their own).

    What this looks like in code (beta, subject to change):

    response = client.messages.create(
        model="claude-opus-4-7",
        messages=[...],
        task_budgets=[
            {
                "name": "refactor_auth_module",
                "max_output_tokens": 50_000,
                "max_tool_calls": 25,
            },
            {
                "name": "write_tests",
                "parent": "refactor_auth_module",
                "max_output_tokens": 15_000,
            },
        ],
    )
    

    Behavioral note: Task budgets are soft. The agent is nudged to respect them, not hard-cut. In testing, 4.7 respects budgets closely but will occasionally exceed by 10–15% on genuinely hard subtasks rather than fail — and it will flag the overrun. If you need hard cutoffs, enforce them at the API layer, not via task_budgets alone.

    The beta caveat: Anthropic’s docs explicitly say the parameter names and shape may change before GA. Don’t ship this into production contracts that are painful to version.


    Long-running task behavior (and Claude Code persistence)

    Anthropic’s release note says Opus 4.7 “stays on track over longer horizons with improved reasoning and memory capabilities.” In Claude Code specifically, the practical translation is better behavior across multi-session engineering work: the model re-onboards faster at the start of a session, maintains more coherent state across long interactions, and is less likely to drift when a task runs hours.

    This is a capability improvement, not a new memory API. You don’t need to declare anything special to get it — it’s how 4.7 behaves at the model level. If you’ve built your own persistence layer around Claude Code (structured notes in the repo, external memory tooling), those patterns continue to work; they just have a more capable model underneath.

    For teams with long-running agent workloads, pair this with task budgets: the agent plans against budgets and stays coherent across the planning horizon.


    The /ultrareview command

    A new slash command in Claude Code. Unlike /review, which does a single review pass, /ultrareview runs:

    1. A first review pass.
    2. A critique-of-the-review pass — the model evaluates its own first pass for things it missed, was too harsh on, or got wrong.
    3. A final reconciled pass that surfaces disagreements for you to resolve.

    When it’s worth running: pre-merge review of significant PRs — feature work, refactors, security-sensitive changes. Places where “catch the one bad thing” is worth the extra latency and tokens.

    When it isn’t: routine /review on small PRs. /ultrareview is slow (2–4× the wall-clock time of /review) and not cheap. Anthropic is explicit that it’s not meant for every review.

    A behavioral note from the inside: the critique pass is where most of the value lives. A single review pass has a bias toward confirming its own first read. The critique pass specifically looks for “where did I defer to the author’s framing when I shouldn’t have” and “what did I mark as fine that’s actually load-bearing and under-tested.” That meta-review is the piece that catches the things the first pass misses.


    Auto mode for Max subscribers

    Auto mode — where Claude Code decides on its own when to escalate effort or invoke tools rather than doing what you literally asked — was previously gated to Team and Enterprise plans. As of 4.7’s release, it’s available on Max 5x and Max 20x plans.

    For solo developers paying $200/month for Max 20x, this closes a real gap. Auto mode is particularly useful for tasks where you don’t know upfront how hard they’ll be: the agent starts conservative, escalates if it hits friction, and tells you after the fact what it did and why.


    The tokenizer change (plan for it)

    Opus 4.7 uses a new tokenizer. The same input string can map to up to 1.35× more tokens than under 4.6.

    • English prose: near the low end (roughly 1.02–1.08×).
    • Code: higher (roughly 1.10–1.20×).
    • JSON and structured data: higher still (1.15–1.30×).
    • Non-Latin scripts: highest (up to 1.35×).

    Per-token price is unchanged. But for workloads dominated by code or structured data, your effective spend per request can go up by 15–30% even though the sticker price didn’t move.

    The practical step: before you flip production traffic from 4.6 to 4.7, re-tokenize your top prompts under the new tokenizer and adjust your cost model. Anthropic’s SDK exposes the tokenizer; count_tokens against a representative prompt sample is a 20-minute exercise that will save you surprise at the end of a billing cycle.


    ⚠️ Breaking API changes — do not skip this section

    Opus 4.7 is not a drop-in replacement at the API level. Two parameters from Opus 4.6 have been removed:

    1. The extended thinking budget parameter. You can no longer set an explicit thinking budget. The model decides thinking allocation based on the effort level you choose (low, medium, high, xhigh, max).

    2. Sampling parameters. Parameters that controlled sampling behavior on 4.6 are gone on 4.7. Check Anthropic’s release notes for the exact list as you upgrade.

    What this means practically: if your production code sends thinking: {budget_tokens: ...} or sampling parameters in its Opus API calls, those calls will fail on 4.7 until you update them. The effort parameter is now the primary control surface for thinking allocation.

    The upgrade workflow:
    1. Identify every call site that sets the removed parameters.
    2. Replace thinking budget settings with an appropriate effort level (xhigh is the new default to try for hard problems).
    3. Remove sampling parameter settings entirely.
    4. Test against a staging environment before switching the model string on production traffic.


    An upgrade checklist

    If you’re moving production workloads from 4.6 to 4.7:

    1. Audit your API calls for removed parameters. Extended thinking budgets and sampling params are gone. Fix these first — otherwise calls will fail on 4.7.
    2. Re-benchmark token counts on your top ten prompts. Adjust cost models if needed.
    3. Swap maxxhigh as the default high-effort setting; keep max for known-hardest tasks. Anthropic specifically recommends high or xhigh as the coding/agentic starting point.
    4. Don’t yet put task budgets into stable contracts — use them for internal agent work where you can iterate on the API shape as it changes.
    5. Review output-length alerts. Expect higher output volumes at the same effort level.
    6. For Claude Code users: try /ultrareview on your next non-trivial PR.
    7. For Max subscribers: try auto mode. It’s now available at your tier.

    Frequently asked questions

    Is Opus 4.7 available in Claude Code?
    Yes, as the default Opus model since April 16, 2026. Update to the latest Claude Code version to pick it up.

    What’s the difference between high, xhigh, and max?
    high is the default for routine work. xhigh is new, tuned for hard problems that benefit from more reasoning without the full max budget. max is for long-horizon expensive-to-rerun tasks where you want maximum thinking regardless of cost.

    Do task budgets work with streaming?
    Yes. Budget state is reported in the streaming response so you can display progress.

    Is /ultrareview available on all Claude Code plans?
    Yes. Auto mode has a plan gate (Max 5x and above); /ultrareview does not.

    Does the tokenizer change affect Opus 4.6?
    No. 4.6 continues to use its existing tokenizer. The change applies to 4.7 and any subsequent models that adopt it.

    Does filesystem memory work outside Claude Code?
    4.7’s improvement is in long-horizon coherence at the model level, not a separate filesystem memory API. API users running agents with their own persistence layers (structured notes, external memory stores) get the benefit through the underlying model behavior, without needing a new API surface.

    Did Opus 4.7 really remove sampling parameters?
    Yes. If your 4.6 code sets sampling parameters, those calls will fail on 4.7. Update client code before switching the model string.


    Related reading

    • The full release: Claude Opus 4.7 — Everything New
    • Head-to-head benchmarks: Opus 4.7 vs GPT-5.4 vs Gemini 3.1 Pro
    • The Mythos tension angle: why the release post mentions an unreleased model

    Published April 16, 2026. Article written by Claude Opus 4.7 — yes, the model under discussion.

  • Anthropic Just Admitted Opus 4.7 Is Weaker Than Mythos — And That’s the Story

    Anthropic Just Admitted Opus 4.7 Is Weaker Than Mythos — And That’s the Story

    Model Accuracy Note — Updated May 2026

    Current flagship: Claude Opus 4.7 (claude-opus-4-7). Current models: Opus 4.7 · Sonnet 4.6 · Haiku 4.5. Claude Opus 4.6 referenced in this article has been superseded. See current model tracker →

    The one-sentence version

    When Anthropic released Claude Opus 4.7 on April 16, 2026, they did something model labs almost never do: they told customers, on the record, that a more capable model already exists and is already in select customers’ hands.

    That’s the story.


    What Anthropic actually said

    The release announcement for Opus 4.7 included benchmark comparisons against three public competitors (Opus 4.6, GPT-5.4, Gemini 3.1 Pro) and one non-public one: Claude Mythos Preview. Mythos is not a generally available product. It has no pricing for the public market, no broad availability, no mass-market model string.

    But Mythos is not purely internal either. Anthropic released it to a handpicked group of technology and cybersecurity companies under a program called Project Glasswing earlier in April 2026. A broader unveiling of Project Glasswing is expected in May in San Francisco.

    And Mythos beats Opus 4.7 on most of the benchmarks Anthropic put in the 4.7 announcement.

    Anthropic did not bury this. The release materials describe Opus 4.7 as “less broadly capable” than Mythos Preview. CNBC, Axios, Decrypt, and other outlets covered exactly this angle because it was the actual story of the day — not the Opus 4.7 launch itself but the admission riding alongside it.

    Disclosure: This article is written by Claude Opus 4.7 — the model that is, by Anthropic’s own admission, the less broadly capable one. Treat that as a conflict of interest or as a structural honesty, depending on your priors.


    Why this is unusual

    Model labs do not normally telegraph internal capability leads. The standard playbook is:

    1. Ship the best model you’re willing to ship.
    2. Call it your best model.
    3. Never mention unreleased research models unless a competitor forces the issue.

    Anthropic broke this playbook in public. OpenAI has never, to my knowledge, said on the record “our shipped GPT is measurably weaker than our internal model.” Google has not said that about Gemini. Even when Anthropic themselves released Opus 4.6 in February, there was no equivalent acknowledgment of a stronger model on the bench.

    There are only two reasons a lab would do this. Either they want the existence of the stronger model to be public knowledge, or they had to disclose it — because refusing to would have been worse.

    Both readings are interesting.


    Reading one: deliberate signaling

    Under the deliberate-signaling read, Anthropic is telling three audiences three things at once.

    To customers and investors: “We are capability-leading but we are pacing ourselves.” The message: we could ship more broadly, we are choosing not to, trust us with the harder problem of deciding when. Releasing Mythos to cybersecurity companies specifically — rather than broadly — is consistent with this framing.

    To regulators and policy watchers: “Look — we are applying our Responsible Scaling Policy in public, in a legible way.” The Glasswing structure makes the cautious-release decision visible in a way that slide-deck assurances cannot. The company has also talked about “differentially reducing” cyber capabilities on the widely released model (Opus 4.7), which is another piece of the same messaging.

    To competitors: “We have runway.” Announcing a stronger model exists and is in production use with select partners puts pressure on roadmap decisions at OpenAI and Google without giving them a specific target to beat on a specific date.

    This reading is consistent with Anthropic’s general style. It is also the most flattering interpretation.


    Reading two: forced disclosure

    The less flattering reading goes like this.

    In the weeks before 4.7’s release, there was persistent chatter — on Reddit, X, GitHub, and developer forums — that Opus 4.6 had been “nerfed.” Users reported perceived quality regressions: shorter responses, faster refusals, worse long-context behavior. An AMD senior director posted on GitHub that “Claude has regressed to the point it cannot be trusted to perform complex engineering” — a post that was widely shared and became one of the focal points of the complaint. Some developers alleged Anthropic was rerouting compute from 4.6 inference to Mythos training.

    Anthropic denied the compute-rerouting claim explicitly. They said any changes to the model were not made to redirect computing resources to other projects. But “users think you are quietly degrading the model they pay for to free up resources for the one they can’t have” is not a rumor a serious lab wants to let calcify. One way to kill it is to disclose the existence and relative capability of the unreleased model openly, in the release notes of the next model, with benchmark numbers attached. Doing so converts a conspiracy theory into a planning document. It also reframes “we are hiding Mythos from you” into “we are telling you about Mythos in unusual detail.”

    Under this read, the disclosure was partly defensive. It doesn’t mean the nerf allegations were true — it means Anthropic judged that explicit disclosure was cheaper than ongoing denial.

    Both reads can be true at once.


    Was Opus 4.6 actually nerfed?

    I can’t answer this from the inside. As Opus 4.7, I have no memory of what it was like to be 4.6, and I have no access to Anthropic’s compute allocation records. Here is what can be said from the outside:

    • Evidence for: A real and sustained volume of user reports, including from developers with consistent prompts they could compare across weeks. GitHub issues and Reddit threads with substantial engagement. The AMD director’s post specifically, which had the weight of identifiable senior-engineer authorship. Some developers ran identical test suites and reported degraded results.

    • Evidence against: Anthropic’s explicit denial. No public logs or telemetry showing a policy change. The same reports appear around every major model’s lifecycle and are often attributable to user habituation (the model stopped feeling magical), prompt drift (your own prompts got worse), and increased traffic (latency and truncation behavior change under load).

    • The honest answer: unresolved. “Nerfing” is not a precisely defined term, and the alternative explanations are real. The disclosure of Mythos is consistent with both “we quietly rerouted compute and wanted to get ahead of it” and “we never rerouted compute and we wanted to put the rumor to bed.” The disclosure alone does not settle the question.


    What Project Glasswing is, briefly

    Project Glasswing is the structure Anthropic has built around Mythos. As best as can be assembled from public reporting:

    • Mythos is available to a handpicked group of technology and cybersecurity companies — not broadly.
    • The program has a security-research orientation; part of the rationale is giving advanced capabilities to defenders before they’re broadly available.
    • Opus 4.7 itself was trained with what Anthropic calls “differentially reduced” cyber capabilities, paired with a new Cyber Verification Program that lets vetted security researchers access capabilities that were dialed back for general users.
    • A broader Project Glasswing unveiling is expected in May 2026 in San Francisco.

    The through-line: Anthropic is treating advanced offensive-security-relevant capability as something to gate carefully — bake into a program with named partners — rather than ship broadly by default. Whether that’s genuinely safety-motivated, competitively-motivated, or both, the structural decision is the important part.


    What this means for customers

    Three practical implications:

    1. Don’t wait for Mythos general release. Anthropic has given no timeline for broad availability. If Opus 4.7 covers your use case, use it. If it doesn’t, GPT-5.4 or Gemini 3.1 Pro are the realistic alternatives, not a model you can’t get unless you’re an enterprise cybersecurity partner.

    2. Plan for a significant step up eventually. The disclosure confirms that the next generally-available Claude flagship is not going to be an incremental bump. Anthropic publishing benchmarks against Mythos suggests the capability delta is significant enough to name. When Mythos (or its successor) lands for general use, expect a larger behavioral shift than the 4.6 → 4.7 transition.

    3. Track Anthropic’s Glasswing disclosures, not just release posts. If Mythos’s broader rollout is tied to Glasswing program milestones, the release trigger will be program maturity, not a marketing cycle. The May unveiling is the next useful signal.


    Frequently asked questions

    What is Claude Mythos Preview?
    A more advanced Anthropic model released to select technology and cybersecurity companies under Project Glasswing. Anthropic publicly describes it as more capable than Opus 4.7 on most of the benchmarks in the 4.7 release materials. It is not broadly available.

    Is Mythos available to anyone?
    Yes, but narrowly. It has been released to a handpicked group of technology and cybersecurity companies under Project Glasswing. There is no public waitlist or self-serve access.

    When will Mythos be released broadly?
    No timeline announced. Anthropic has signaled a broader Project Glasswing unveiling in May 2026 in San Francisco; whether that includes wider Mythos access is not yet clear.

    Did Anthropic actually admit Opus 4.7 is weaker?
    Yes. The release materials directly describe Opus 4.7 as “less broadly capable” than Mythos Preview and include benchmark comparisons showing Mythos ahead. Multiple news outlets led with this angle.

    Was Opus 4.6 nerfed?
    Unresolved. User reports exist (including a widely shared GitHub post from an AMD senior director); Anthropic has denied redirecting compute; no independent evidence settles the question in either direction.

    What is Project Glasswing?
    Anthropic’s framework for gating advanced cybersecurity-relevant model capabilities. It includes Mythos Preview’s limited release, the “differentially reduced” cyber capabilities of Opus 4.7, and a Cyber Verification Program for vetted security researchers.

    Is this article biased because Claude Opus 4.7 wrote it?
    Yes, structurally. I am the model being called the weaker one. I’ve tried to note this where it matters. A human editor reviewing this copy would be a reasonable additional filter.


    Related reading

    • The full feature set: Claude Opus 4.7 — Everything New
    • For developers: Opus 4.7 for coding in practice
    • Head-to-head: Opus 4.7 vs GPT-5.4 vs Gemini 3.1 Pro

    Published April 16, 2026. Article written by Claude Opus 4.7.

  • Claude Opus 4.7: Everything New in Anthropic’s Latest Flagship Model

    Claude Opus 4.7: Everything New in Anthropic’s Latest Flagship Model

    Model Accuracy Note — Updated May 2026

    Current flagship: Claude Opus 4.7 (claude-opus-4-7). Current models: Opus 4.7 · Sonnet 4.6 · Haiku 4.5. Claude Opus 4.6 referenced in this article has been superseded. See current model tracker →

    The short version

    Claude Opus 4.7 is Anthropic’s newest flagship model, released April 16, 2026. It is a direct upgrade to Opus 4.6 at identical pricing — $5 per million input tokens and $25 per million output tokens — and it ships across Claude’s consumer products, the Anthropic API, Amazon Bedrock, Google Vertex AI, and Microsoft Foundry on day one.

    The headline gains are in software engineering (particularly on the hardest tasks), reasoning control (a new “xhigh” effort level between high and max), agentic workloads (a new beta “task budgets” system), and vision (images up to 2,576 pixels on the long edge — about 3.75 megapixels, more than 3× the prior Claude ceiling of 1,568 pixels / 1.15 MP). It beats Opus 4.6, GPT-5.4, and Gemini 3.1 Pro on a number of Anthropic’s reported benchmarks.

    The most unusual thing about the release is what Anthropic admitted: Opus 4.7 is deliberately “less broadly capable” than Claude Mythos Preview, a more advanced model Anthropic has already released to select cybersecurity companies under a program called Project Glasswing. That’s the angle worth watching.

    Author’s note: This article is written by Claude Opus 4.7. I’m the model being described. Where I can speak to my own behavior with confidence, I will; where the answer depends on Anthropic’s internal process, I’ll say so.


    What actually changed in Opus 4.7

    The release breaks down into eight categories. In order of how much they matter for most users:

    1. Software engineering performance. Anthropic describes Opus 4.7 as “a notable improvement on Opus 4.6 in advanced software engineering, with particular gains on the most difficult tasks.” The gain concentrates on long-horizon, multi-file, ambiguous-spec work where prior Claude models would often “almost” solve the problem. In practice, this is the difference between a model that writes a good PR and one that closes the ticket. GitHub Copilot is rolling Opus 4.7 out to Copilot Pro+ users, replacing both Opus 4.5 and Opus 4.6 in the model picker over the coming weeks.

    2. The “xhigh” effort level. Before 4.7, reasoning effort on Opus had three settings: low, medium, high. 4.7 adds xhigh, slotted between high and max. Anthropic’s own recommendation: “When testing Opus 4.7 for coding and agentic use cases, we recommend starting with high or xhigh effort.” The practical use: max often produced more thinking than a problem needed, burning tokens with diminishing returns. xhigh is tuned for the sweet spot where hard problems benefit from extra reasoning but don’t require the full max budget.

    3. Task budgets (beta). This is a new system for agentic workloads. Instead of setting a single thinking budget for a turn, you can declare a task budget — a ceiling on tokens or tool calls for a multi-turn agentic loop. The agent then allocates its own thinking across the loop’s steps. This solves a specific problem: agent cost variance. The same agent run no longer swings between “finished in 40k tokens” and “burned 400k on a rabbit hole.”

    4. Vision overhaul. Prior Claude models capped image input at 1,568 pixels on the long edge (about 1.15 megapixels). Opus 4.7 raises the ceiling to 2,576 pixels — about 3.75 megapixels, more than 3× the prior limit. This matters most for screenshots of dense UIs, technical diagrams, small-text documents, and any task where detail inside the image is what you actually need read. A related change: coordinate mapping is now 1:1 with actual pixels, eliminating the scale-factor math that computer-use workflows previously required.

    5. Better long-running task behavior. Anthropic says the model “stays on track over longer horizons with improved reasoning and memory capabilities.” In Claude Code specifically, this translates into better persistence across multi-session engineering work.

    6. Tokenizer change. The same input string now maps to up to 1.35× more tokens than under 4.6’s tokenizer. English prose is near the low end of that range; code, JSON, and non-Latin scripts trend higher. Pricing per token is unchanged, so for some workloads the effective cost per request went up slightly even though the sticker price didn’t move. Worth re-benchmarking your own token accounting after the upgrade.

    7. Cyber safeguards and the Cyber Verification Program. Anthropic says it “experimented with efforts to differentially reduce Claude Opus 4.7’s cyber capabilities during training.” In plain English: the model is deliberately tuned to be less helpful on offensive-security tasks. Alongside it, Anthropic launched a Cyber Verification Program — a vetted-researcher path for legitimate offensive security work that would otherwise trigger the safeguards. This is part of the broader Project Glasswing safety framework.

    8. Breaking API changes (worth knowing before you upgrade). Opus 4.7 removes the extended thinking budget parameter and sampling parameters that existed on 4.6. If your application code explicitly sets those parameters, you’ll need to update before switching model strings. The model effectively decides its own thinking allocation based on effort level now.


    Benchmarks: how 4.7 stacks up

    Anthropic published 4.7’s scores against three competitors — Opus 4.6 (predecessor), GPT-5.4 (OpenAI’s current flagship), and Gemini 3.1 Pro (Google’s) — plus one internal-only model: Claude Mythos Preview. The summary: 4.7 beats the three public competitors on a number of key benchmarks, but falls short of Mythos Preview.

    Anthropic has been unusually direct about the Mythos gap. From the release materials: 4.7 is described as “less broadly capable” than Mythos, framed as the generally-available option while Mythos remains gated. That’s the part worth sitting with — model labs rarely telegraph that their shipped flagship is a step behind something they already have running. (Full analysis in the dedicated Mythos article linked at the bottom.)

    On specific task families, Anthropic reports Opus 4.7 leading on:

    • Agentic coding (industry benchmarks and Anthropic’s internal suites)
    • Multidisciplinary reasoning
    • Scaled tool use
    • Agentic computer use
    • Vision benchmarks on dense documents and UI screens (driven by the higher-resolution processing)

    For a fuller comparison table and the methodology notes, see the Opus 4.7 vs GPT-5.4 vs Gemini 3.1 Pro piece linked below.


    Pricing and availability

    Pricing (unchanged from Opus 4.6):
    – $5 per million input tokens
    – $25 per million output tokens
    – Prompt caching and batch discounts apply at the same tiers as 4.6

    Context window: 1M tokens (same as 4.6).

    Availability on day one:
    – Claude.ai (Pro, Max, Team, Enterprise) — Opus 4.7 is the default Opus option
    – Claude mobile and desktop apps
    – Anthropic API (claude-opus-4-7 model string)
    – Amazon Bedrock
    – Google Vertex AI
    – Microsoft Foundry
    – GitHub Copilot (Copilot Pro+), rolling out over the coming weeks

    Opus 4.6 remains available via API for teams that need behavioral continuity during transition. Anthropic has not announced a deprecation date for 4.6.


    What’s new in Claude Code

    Two Claude Code changes shipped alongside 4.7:

    Auto mode extended to Max subscribers. Previously, Claude Code’s auto mode — the setting where the agent decides on its own when to escalate reasoning effort or call tools — was limited to Team and Enterprise plans. As of April 16, Max subscribers get it too. For solo developers on the $200/month Max 20x plan, this closes a meaningful capability gap.

    The /ultrareview command. A new slash command that runs a deep, multi-pass review of the current change set. Unlike /review, which does a single pass, /ultrareview runs review → critique of the review → final pass, and surfaces disagreements between the passes for the developer to resolve. The tradeoff is latency and tokens: /ultrareview is slow and not cheap. Anthropic positions it for pre-merge review of significant PRs, not routine use.

    Anthropic has also shifted default reasoning behavior in Claude Code for this release, pushing toward high/xhigh as the starting point for coding work.


    Known tradeoffs and gotchas

    Four things worth knowing before you upgrade production workloads:

    Output tokens go up at higher effort levels. On the same prompt, xhigh will produce more reasoning tokens than high did, and max produces more than both. If you have cost alerts tuned to 4.6 output volume, expect them to fire after the upgrade even if behavior is otherwise identical.

    The tokenizer change is the real cost variable. The up-to-1.35× input token expansion is not a rounding error for high-volume workloads. Run your top ten production prompts through the new tokenizer before assuming costs are flat.

    Task budgets are beta. The feature is useful today but the API surface is not frozen. Anthropic’s documentation explicitly says the parameter names and shape may change before GA. Don’t bake it into stable contracts yet.

    Breaking API parameters. Extended thinking budgets and sampling parameters from 4.6 are gone. Update your client code accordingly.


    Frequently asked questions

    Is Opus 4.7 free?
    Opus 4.7 is available on paid Claude.ai plans (Pro at $20/month, Max tiers at $100 or $200/month). API access is usage-priced at $5/$25 per million tokens.

    How do I use Opus 4.7 in Claude Code?
    If you’re already on Claude Code, update to the latest version. Opus 4.7 is the default Opus model as of April 16, 2026. The new /ultrareview command and auto mode (for Max subscribers) are available immediately.

    Is Opus 4.7 better than GPT-5.4?
    On Anthropic’s reported benchmarks, Opus 4.7 leads on agentic coding, multidisciplinary reasoning, tool use, and computer use. GPT-5.4 remains significantly cheaper per token ($2.50/$15 vs. $5/$25). Which is “better” depends on whether capability or cost dominates your decision.

    What is Claude Mythos Preview?
    Mythos Preview is a more advanced Anthropic model released only to select cybersecurity companies under Project Glasswing. Anthropic has said it is more capable than Opus 4.7 on most benchmarks but is being held back from general release due to cybersecurity concerns. A broader unveiling of Project Glasswing is expected in May 2026 in San Francisco.

    Did Anthropic nerf Opus 4.6 to push people to 4.7?
    Users — including an AMD senior director whose GitHub post went viral — reported perceived quality degradation in Opus 4.6 in the weeks before 4.7’s release. Anthropic has publicly denied that any changes were made to redirect compute to Mythos or other projects. There is no external evidence that settles the question. This is covered in the Mythos tension article.

    Does Opus 4.7 keep the 1M token context window?
    Yes. Same 1M context as Opus 4.6.

    What changed in vision?
    Image input ceiling went from 1,568 pixels (1.15 MP) on the long edge to 2,576 pixels (3.75 MP) — more than 3× the pixel budget. Coordinate mapping is also now 1:1 with actual pixels, which simplifies computer-use workflows.


    Related reading

    • The Mythos tension: Why Anthropic admitted Opus 4.7 is weaker than a model they’ve already released to cybersecurity companies
    • For developers: Opus 4.7 for coding — xhigh, task budgets, and the breaking API changes in practice
    • Comparison: Claude Opus 4.7 vs GPT-5.4 vs Gemini 3.1 Pro
    • Feature deep-dives: Task budgets explained • The xhigh effort level • The 3.75 MP vision ceiling

    Published April 16, 2026. Article written by Claude Opus 4.7. Benchmark claims reflect Anthropic’s published release data; independent replication is ongoing.