Category: Claude AI

Complete guides, tutorials, comparisons, and use cases for Claude AI by Anthropic.

  • What Notion Agents Can’t Do Yet (And When to Reach for Claude Instead)

    What Notion Agents Can’t Do Yet (And When to Reach for Claude Instead)

    I run both Notion Custom Agents and Claude every working day. I have opinions about when each one earns its place and when each one doesn’t. This article is those opinions, named clearly, with no vendor fingers on the scale.

    Most comparative writing about AI tools is written by people with an incentive to recommend one over the other — affiliate programs, platform partnerships, the writer’s own consulting practice specializing in one side. This piece doesn’t have that problem. I use both, I pay for both, and if one of them got replaced tomorrow, the pattern I run would survive with a different tool slotted into the same role. The tools are interchangeable. The judgment about which one to reach for is not.

    Here’s the honest map.


    The short version

    Use Notion Custom Agents when: the work is a recurring rhythm, the context lives in Notion, the output is a Notion page or database change, and you’re willing to spend credits on it running in the background.

    Use Claude when: the work needs real judgment, the context is complex or contested, the output is something that needs a human’s voice and review, or the workflow crosses enough systems that the agent’s world is too small.

    Those two sentences will save most operators ninety percent of the architecture mistakes I see people make. The rest of this article is specificity about why, because general rules only take you so far before you need to know what’s actually going on under the hood.


    Where Notion Custom Agents genuinely shine

    I’m going to start with the positive because anyone who only reads the critical part of a comparative article will walk away with a warped picture. Custom Agents are genuinely impressive when they fit the job.

    Recurring synthesis tasks across workspace data. The daily brief pattern I’ve written about works better in a Custom Agent than in Claude. The agent runs on schedule, reads the right pages, writes the synthesis back into the workspace, and is done. Claude can do this too, but Custom Agents do it without you remembering to prompt them. That’s the whole point of the “autonomous teammate” framing, and for rhythmic synthesis work, it genuinely delivers.

    Inbox triage. An agent watching a database with a clear decision tree — categorize incoming requests, assign a priority, route to the right owner — is a sweet-spot Custom Agent. It does the boring sort every day, flags the ones it’s unsure about, and keeps the pile from growing. Real teams are reportedly triaging at over 95% accuracy on inbound tickets with this pattern.

    Q&A over workspace knowledge. Agents that answer company policy questions in Slack or provide onboarding guidance for new hires are quietly some of the most valuable agents in production. They replace hours of repetitive answer-the-same-question work, and because the answers come from actual workspace content, the accuracy is high when the workspace is well-maintained.

    Database enrichment. An agent that watches for new rows in a database, looks up additional context, and fills in fields automatically is a beautiful fit. The agent is doing deterministic-adjacent work with just enough judgment to handle edge cases. This is exactly what Custom Agents were designed for.

    Autonomous reporting. Weekly sprint recaps, monthly OKR reports, Friday retrospectives. Reports that would otherwise require someone to sit down and write them, now drafted automatically from the workspace state.

    For these categories, Custom Agents are the right tool, and Claude is the wrong tool even though Claude would technically work. The wrong-tool-even-though-it-works framing matters because operators often default to Claude for everything, which is expensive in different ways.


    Where Notion Custom Agents break down

    Now the honest part. Custom Agents have real limits, and pretending otherwise is how operators get burned.

    1. Anything that requires serious reasoning across contested information

    Custom Agents are capable of synthesis, but the quality of their synthesis degrades when the inputs disagree with each other, when the right answer isn’t on the page, or when the task requires actually thinking through a problem rather than summarizing existing context.

    The signal that you’ve hit this limit: the agent produces an output that sounds plausible, reads well, and is subtly wrong. If you need to double-check every agent output in a category of work because you can’t trust the judgment, that category of work shouldn’t be going through an agent. Use Claude in a conversation where you can actually interrogate the reasoning.

    Specific examples where this shows up: strategic decisions, conflicting client feedback, legal or compliance-adjacent questions, anything that involves weighing tradeoffs. The agent will produce an answer. The answer will often be wrong in a specific way.

    2. Long-horizon work that needs to hold nuance across steps

    Custom Agents are designed for bounded tasks with clear inputs and clear outputs. When you try to use them for work that requires holding nuance across many steps — drafting a long document, executing a multi-stage strategic plan, navigating a complex workflow — the wheels come off.

    Part of this is architectural: agents have limited ability to carry state across runs in the way an extended Claude conversation can. Part of it is practical: the “one agent, one job” principle Notion itself recommends is a hard constraint, not a style guideline. When you try to make an agent do multiple things, you get an agent that does each of them worse than a single-purpose agent would.

    If the job you’re thinking about is genuinely one coherent thing that happens to have many steps, and the steps inform each other, it’s probably a Claude conversation, not a Custom Agent.

    3. Work that needs a specific human voice

    This one is more important than most operators realize. Agents write in a synthesized style. It’s a perfectly fine style. It’s also recognizable as a perfectly fine style, which is the problem.

    If the output is going to have your name on it — client communications, thought leadership, outbound that should sound like you — the agent’s default voice will flatten whatever was distinctive about your writing. You can push back on this with instructions, and good instructions help a lot. But the underlying truth is that Custom Agents optimize for “sounds like a competent business writer,” and competent business writing is a commodity. If you sell distinctiveness, the agent is a liability.

    Claude in a conversation, with your active voice-shaping, produces writing that can actually sound like you. Custom Agents optimize for a different thing.

    4. Anything requiring real-time web context

    Custom Agents can reach external tools via MCP, but they don’t have a general ability to browse the live web and integrate what they find into their reasoning. If the work requires recent news, real-time market data, or anything that isn’t in a known database the agent can query, the agent will either fail, hallucinate, or return stale information from whatever workspace snapshot it had.

    Claude — with web search enabled, with the ability to fetch arbitrary URLs, with research capabilities — handles this class of work dramatically better. The right architectural response: use Claude for anything with a live-web dependency, let Custom Agents handle the parts that don’t.

    5. Deep technical work

    Custom Agents can technically do technical work. They should mostly not be asked to. Writing code, debugging failures, analyzing logs, reasoning through system architecture — these live in Claude Code’s territory, not Custom Agents’ territory. The Custom Agent framework was built for operational workflows, and while it will attempt technical tasks, it attempts them at the quality of a generalist, not a specialist.

    The sign you’ve crossed this line: the agent is producing code or technical reasoning that a competent human reviewer would push back on. Move the work to Claude Code, which was built for exactly this.

    6. High-stakes writes with permanent consequences

    Agents execute. They don’t second-guess themselves. An agent configured to send emails will send emails. An agent configured to update client records will update client records. An agent configured to delete rows will delete rows.

    When the cost of the agent doing the wrong thing is high — sending a message you can’t unsend, overwriting data you can’t recover, triggering a payment you can’t reverse — the discipline is: don’t let the agent do it without human approval. Use “Always Ask” behavior. Use a draft-and-review pattern. Use anything that puts a human in the loop before the irreversible action.

    Operators who ship fast and iterate freely tend to underweight this category. The day you discover it’s been quietly overwriting the wrong database field for two weeks is the day you wish you’d built the review gate.

    7. Credit efficiency for genuinely reasoning-heavy work

    This one is practical rather than architectural. Starting May 4, 2026, Custom Agents run on Notion Credits at roughly $10 per 1,000 credits. Internal Notion data suggests Custom Agents run approximately 45–90 times per 1,000 credits for typical tasks — meaning tasks that require more steps, more tool calls, or more context cost proportionally more credits per run. That means simple recurring tasks are cheap. Complex reasoning-heavy tasks add up.

    If you’re building an agent that does heavy reasoning work many times per day, the credit cost can exceed what the same work would cost through Claude’s API directly, especially on higher-capability Claude models called directly without the Notion overhead. For high-frequency reasoning work, run the math before you commit to the agent architecture.


    Where Claude genuinely wins

    The other side of the honest comparison. Claude earns its place in categories where Custom Agents either can’t operate or operate poorly.

    Strategic thinking conversations. When you’re working through a decision, evaluating a tradeoff, or thinking through a strategy, Claude in an extended conversation is the right tool. The back-and-forth is the whole point. You can interrogate reasoning, push back on conclusions, reframe the problem mid-conversation. An agent that produces a one-shot answer, no matter how good, is the wrong shape for this kind of work.

    Drafting with voice. Writing that needs to sound like a specific person is Claude’s territory. You can load up Claude with context about your voice — past writing, tonal preferences, things to avoid — and get output that actually reads as yours. Notion Custom Agents will always produce generic-flavored writing. That’s fine for internal reports. It’s a problem for anything external.

    Code and technical work. Claude Code specifically is built for technical depth. It reads codebases, executes in a terminal, calls tools, iterates on failures. Custom Agents will flail at the same work.

    Research synthesis across live sources. Claude with web search and fetch capabilities handles “go read this, this, and this, and tell me what the current state actually is” in a way Custom Agents structurally can’t. Anything that requires reaching outside a known data universe is Claude.

    Work that crosses many systems. When a workflow needs to touch code, Notion, a database, an external API, and a human review, Claude Code with the right MCP servers connected coordinates across them better than a Custom Agent inside Notion does. The agent’s world is Notion-plus-connected-integrations. Claude’s world is wider.

    Anything requiring judgment about whether to proceed. Agents execute. Claude in a conversation can pause, check with you, and ask “should I actually do this?” That judgment layer is frequently the most important part of the workflow.


    The pattern that actually works (both, in the right places)

    The operators who get this right aren’t choosing one tool over the other. They’re running both, in specific roles, with clear handoffs.

    The pattern I run:

    Rhythmic operational work lives in Custom Agents. Morning briefs, triage, weekly reviews, database enrichment, Q&A over workspace knowledge. Things that happen repeatedly, have clear inputs, and produce workspace-shaped outputs.

    Judgment-heavy work lives in Claude conversations. Strategic decisions, drafting with voice, research, anything requiring back-and-forth. I do this work in Claude chat sessions with the Notion MCP wired in, so Claude has real context when I need it to.

    Technical work lives in Claude Code. Building scripts, managing infrastructure, debugging, writing code. Custom Agents don’t touch this.

    Handoffs are explicit. When I make a decision in Claude that needs to become operational, it lands as a task or brief in a Notion database, and from there a Custom Agent can pick it up. When a Custom Agent surfaces something that needs judgment, it creates an escalation entry that shows up on my Control Center, where I engage Claude to think through it.

    The two systems pass work back and forth through the workspace. Neither tries to do the other’s job. The seams are the Notion databases where state lives.

    This is not the vendor-shaped pattern. The vendor-shaped pattern says “Custom Agents can handle everything.” The operator-shaped pattern says “Custom Agents handle what they’re good at, and when the work exceeds their reach, another tool takes over with a clean handoff.”


    The decision tree, when you’re not sure

    For a specific piece of work, run these questions in order. Stop at the first “yes.”

    Does this task need a specific human voice, or could it be written by any competent person? If it needs your voice, reach for Claude. If it doesn’t, move on.

    Does this task require reasoning across contested or ambiguous information? If yes, Claude. If no, move on.

    Does this task need real-time web context, live external data, or information not already in a known database? If yes, Claude. If no, move on.

    Does this task involve code, system architecture, or technical depth? If yes, Claude Code. If no, move on.

    Does this task have high-stakes irreversible consequences? If yes, wrap it in a human-approval gate — either run it through Claude where the human is in the loop, or use Custom Agents with “Always Ask” behavior.

    Does this task happen repeatedly on a schedule or in response to workspace events? If yes, Custom Agent. This is the sweet spot.

    Is the output a Notion page, database row, or something that stays in the workspace? If yes, Custom Agent is usually the right call.

    Is the task bounded enough that it could be described in a couple of clear sentences? If yes, Custom Agent. If it’s sprawling, it’s probably too big for an agent.

    If you’re through the tree and still not sure, default to Claude. Claude is more expensive in money and cheaper in hidden cost than a Custom Agent running the wrong job.


    The failure modes I’ve seen

    Specific patterns that go wrong, in my observation:

    The “agent for everything” operator. Someone who just got access to Custom Agents and is building agents for tasks that don’t need agents. The agents mostly work. The ones that mostly work waste credits on tasks a template or a simple automation would handle. The ones that partially work produce quiet low-grade mistakes that accumulate.

    The “Claude for everything” operator. The inverse. Someone who got comfortable with Claude and hasn’t made the leap to letting agents handle the rhythmic work. They’re paying the context-loss tax every morning, doing the triage manually, writing every brief from scratch. Claude is too expensive a tool — in attention, if not dollars — to run routine work through.

    The operator who built one giant agent. Custom Agents are meant to be narrow. Someone violates the “one agent, one job” principle by building an agent that does inbox triage and database updates and weekly reports and client communications. The agent becomes hard to debug, expensive to run, and unreliable across its many hats. The fix is almost always breaking it into three or four single-purpose agents.

    The operator who didn’t build review gates. An agent sending emails without human approval. An agent deleting rows based on inferred criteria. An agent updating client-facing pages from an unchecked data source. The cost of the first real mistake exceeds the cost of the review gate that would have prevented it, every time.

    The operator who never checked credit consumption. Custom Agents consume credits based on model, steps, and context size. An operator who built ten agents and never looked at the dashboard ends up surprised when the monthly bill is much higher than expected. The fix is easy — Notion ships a credits dashboard — but it has to actually get checked.


    The timing honest note

    A piece of this article that ages. These comparisons are true in April 2026. Custom Agents are new enough that the feature set will expand significantly over the next year. Claude is evolving rapidly. The specific gaps I’ve named may close; new gaps may open in different directions.

    What won’t change is the pattern: some work wants a specialized tool, some work wants a general-purpose one. Some work is rhythmic, some is judgment-driven. Some work lives inside a workspace, some crosses systems. The vocabulary for when to use which tool will evolve; the underlying truth that different shapes of work deserve different tools will not.

    If you’re reading this in 2027 and Custom Agents have shipped fifteen new capabilities, the specific “can’t do” list will be shorter. The decision tree at the top of this article will still work. That’s the part worth holding onto.


    What I’m not saying

    A few clarifications because I want to be clear about what this article is and isn’t.

    I’m not saying Custom Agents are bad. They’re genuinely good at what they’re good at. They’re saving me hours per week on work I used to do manually.

    I’m not saying Claude is strictly better. Claude is more capable at a broader set of tasks, but it also costs more, requires active operator engagement, and can’t sit in the background running overnight rhythms the way Custom Agents can.

    I’m not saying there’s one right answer for every operator. Different operators with different businesses and different workflows will land on different splits. The decision tree helps, but it’s a starting point, not a conclusion.

    I’m not saying this is permanent. Tool landscapes change fast. Six months from now there may be categories where Custom Agents beat Claude that don’t exist today, and vice versa. What matters is developing the habit of asking “which tool is this work actually shaped for?” instead of defaulting to whichever one you learned first.


    The one thing I’d want you to walk away with

    If you read nothing else in this article, this is the sentence I’d want in your head:

    Rhythmic operational work wants an agent; judgment-heavy work wants a conversation.

    That distinction — rhythm versus judgment — cuts through almost every architecture question you’ll have when deciding what to route where. It’s not the only dimension that matters, but it’s the one that settles the most decisions correctly.

    Work that happens on a schedule or in response to an event, with bounded inputs and clear outputs? That’s rhythm. Build a Custom Agent.

    Work that requires thinking through tradeoffs, integrating disparate information, or producing output with specific voice and judgment? That’s a conversation. Engage Claude.

    Get that right for most of your workflows and the rest of the architecture tends to sort itself out.


    FAQ

    Can’t Custom Agents do everything Claude can do, just inside Notion? No. Custom Agents are optimized for bounded, rhythmic, workspace-shaped tasks. They can technically attempt work that requires deep reasoning, specific voice, or live external context, but the results degrade in predictable ways. Claude — in a conversation or in Claude Code — handles those categories better.

    Should I just use Claude for everything then? No. Rhythmic operational work — morning briefs, triage, weekly reports, database enrichment — is genuinely better in Custom Agents than in Claude, because the “autonomous teammate running while you sleep” property matters. The right answer is running both, in their respective sweet spots.

    What’s the cost comparison? Starting May 4, 2026, Custom Agents cost roughly $10 per 1,000 Notion Credits. Internal Notion data suggests agents run approximately 45–90 times per 1,000 credits depending on task complexity. Claude’s subscription pricing is flat. For high-frequency simple tasks, Custom Agents are usually cheaper. For heavy reasoning work done many times per day, running Claude directly can be more cost-efficient.

    What about Notion Agent (the personal one) versus Claude? Notion Agent is Notion’s on-demand personal AI — you prompt it, it responds. It’s fine for in-workspace tasks where you need AI help with content you’re already looking at. For deeper reasoning, complex drafting, or cross-tool work, Claude is more capable. Notion Agent is a good ambient utility; Claude is a general-purpose intelligence layer.

    Which should I learn first if I’m new to both? Claude. Learn to think with an AI as a thinking partner before you try to build autonomous agents. Once you understand what AI can and can’t do in a conversation, the design decisions for Custom Agents become much clearer. Jumping to Custom Agents without the Claude foundation is how operators end up with agents that don’t work as expected.

    Can Custom Agents use Claude models? Yes. Custom Agents let you pick the AI model they run on. Claude Sonnet and Claude Opus are both available, along with GPT-5 and various other models. This means the underlying intelligence of a Custom Agent can be Claude — you’re choosing between Claude-as-conversation (claude.ai, Claude Desktop, Claude Code) and Claude-as-embedded-agent (Custom Agent running Claude). Different interfaces, same underlying model in that case.

    What if I want Claude to work autonomously on a schedule like Custom Agents do? Possible, but requires more work. Claude Code can be scripted; you can run it on a cron job; you can set up headless workflows. But the “out of the box autonomous teammate” experience is Notion’s current strength, not Anthropic’s. If you want autonomous-background-work without building your own infrastructure, Custom Agents are easier.

    How do I decide for my specific situation? Run the decision tree in the article. If you’re still unsure, default to Claude — it’s the more general-purpose tool, and the cost of using the wrong tool for judgment-heavy work is higher than the cost of using the wrong tool for rhythmic work. You can always migrate a recurring workflow to a Custom Agent once you understand the shape.


    Closing note

    The honest comparison isn’t one tool versus the other. It’s understanding that different shapes of work want different shapes of tool, and that most operators lose more time to the mismatch than to any individual tool’s limitations.

    Custom Agents are good at being Custom Agents. Claude is good at being Claude. Neither is good at being the other. Use both, in the places each belongs, with clean handoffs between them, and the stack hums.

    Skip the vendor narratives. Read your own workflows. Route each piece to the tool it’s actually shaped for. That’s the whole game.


    Sources and further reading

    Related Tygart Media pieces:

  • How to Wire Claude Into Your Notion Workspace (Without Giving It the Keys to Everything)

    How to Wire Claude Into Your Notion Workspace (Without Giving It the Keys to Everything)

    The step most tutorials skip is the one that actually matters.

    Every guide to connecting Claude to Notion walks you through the same mechanical sequence — OAuth flow, authentication, running claude mcp add, and done. It works. The connection lights up, Claude can read your pages, write to your databases, and suddenly your AI has the run of your workspace. The tutorials stop there and congratulate you.

    Here’s the part they don’t mention: according to Notion’s own documentation, MCP tools act with your full Notion permissions — they can access everything you can access. Not the pages you meant to share. Everything. Every client folder. Every private note. Every credential you ever pasted into a page. Every weird thing you wrote about a coworker in 2022 and forgot was there.

    In most setups the blast radius is enormous, the visibility is low, and the decision to lock it down happens after something goes wrong instead of before.

    This is the guide that takes the extra hour. Wiring Claude into your Notion workspace is straightforward. Wiring Claude into your Notion workspace without giving it the keys to everything takes a few additional decisions, a handful of specific configuration choices, and a mental model for what should and shouldn’t flow across the connection. That’s the hour worth spending.

    I run this setup across a real production workspace with dozens of active properties, real client work, and data I genuinely don’t want an AI to have unbounded access to. The pattern below is what works. It is also honest about what doesn’t.


    Why Notion + Claude is worth doing carefully

    Before the mechanics, it’s worth being clear about what you get when you wire this up correctly.

    Claude with access to Notion is not Claude with a better search function. It is a Claude that can read the state of your business — briefs, decisions, project status, open loops — and reason across them to help you run the operation. It can draft follow-ups to conversations it finds in your notes. It can pull together summaries across projects. It can take a decision you’re weighing, find every related piece of context in the workspace, and give you a grounded opinion instead of a generic one.

    That’s the version most operator-grade users want. And it’s only valuable if the trust boundary is drawn correctly. A Claude that has access to your relevant context is a superpower. A Claude that has access to everything you’ve ever written is a liability waiting to catch up with you.

    The whole article is about drawing that boundary on purpose.


    The two connection options (and which one you actually want)

    There are two ways to connect Claude to Notion in April 2026, and the right one depends on what you’re doing.

    Option 1: Remote MCP (Notion’s hosted server). You connect Claude — whether that’s Claude Desktop, Claude Code, or Claude.ai — to Notion’s hosted MCP endpoint at https://mcp.notion.com/mcp. You authenticate through OAuth, which opens a browser window, you approve the connection, and it’s live. Claude can now read from and write to your workspace based on your access and permissions.

    This is the officially-supported path. Notion’s own documentation explicitly calls remote MCP the preferred option, and the older open-source local server package is being deprecated in favor of it. For most operators, this is the right answer.

    Option 2: Local MCP (the legacy / open-source package). You install @notionhq/notion-mcp-server locally via npm, create an internal Notion integration to get an API token, and configure Claude to talk to the local server with your token. You then have to manually share each Notion page with the integration one by one — the integration only sees pages you explicitly grant access to.

    This path is more work and is being phased out. But there’s one genuine reason to still use it: the local path uses a token and the remote path uses OAuth, which means the local path works for headless automation where a human isn’t around to click OAuth buttons. Notion MCP requires user-based OAuth authentication and does not support bearer token authentication. This means a user must complete the OAuth flow to authorize access, which may not be suitable for fully automated workflows.

    For 95% of setups, remote MCP is the right answer. For the 5% running true headless agents, the local package is still the pragmatic choice even though it’s on its way out.

    The rest of this guide assumes remote MCP. I’ll flag the places the advice differs for local.


    The quiet part Notion tells you out loud

    Before we get to the setup, one more thing you need to internalize because it shapes every decision below.

    From Notion’s own help center: MCP tools act with your full Notion permissions — they can access everything you can access.

    Read that sentence twice.

    If you are a workspace member with access to 140 pages across 12 databases, your Claude connection can access 140 pages across 12 databases. Not the 15 you’re working on today. All of them. OAuth doesn’t scope you down to “this project.” It says yes or no to “can Claude see your workspace.”

    This is fine when your workspace is already organized the way you’d want an AI to see it. It is catastrophic when it isn’t, because most workspaces have accumulated years of drift, private notes, credential-adjacent content, sensitive client data, and old experiments that nobody bothered to clean up.

    So before you connect anything, you do the workspace audit. Not because Notion says so. Because your future self will thank you.


    The pre-connection audit (the step tutorials skip)

    Fifteen minutes with the workspace, before you click the OAuth button. Here’s the checklist I run through:

    Find anything that looks like a credential. Search your workspace for the words: password, API key, token, secret, bearer, private key, credentials. Read the results. Move anything sensitive to a credential manager (1Password, Bitwarden, a password-protected vault — not Notion). Delete the Notion copies.

    Find anything you wouldn’t want an AI to read. Search for: divorce, legal, lawsuit, personal, venting, complaint, therapist. Yes, really. People put things in Notion they’ve forgotten are in Notion. An AI that has access to everything you can access will find those things and occasionally surface them in responses. This is embarrassing at best and career-ending at worst.

    Look at your database of clients or contacts. Is there anything in there that shouldn’t travel through an AI provider’s servers? Notion processes MCP requests through Notion’s infrastructure, not yours. Sensitive legal matters, medical information, financial details about third parties — these may deserve a workspace or sub-page that stays outside of what Claude is allowed to see.

    Identify what Claude actually needs. Make a short list: your active projects, your working databases, your briefs page, your daily/weekly notes. This is what you actually want Claude to have context on. The rest is noise.

    Decide your posture. Two options here. You can run Claude against your main workspace and accept the blast radius, or you can create a separate workspace (or a teamspace) that contains only the pages and databases you want Claude to see, and connect Claude to that one. The second option is more work upfront. It is also the only version that actually draws the boundary.

    I run the second option. My Claude-facing workspace is genuinely a subset of what I work with, and the rest of my Notion is on a different membership. It took an hour to set up. It was worth it.


    Connecting remote MCP to Claude Desktop

    Now the mechanics. Starting with Claude Desktop because it’s the simplest.

    Claude Desktop gets Notion MCP through Settings → Connectors (not the older claude_desktop_config.json file, which is being phased out for remote MCP). This is available on Pro, Max, Team, and Enterprise plans.

    Open Claude Desktop. Settings → Connectors. Find Notion (or add a custom MCP server with the URL https://mcp.notion.com/mcp). Click Connect. A browser window opens, Notion asks you to authenticate, you approve. Done.

    The connection now lives in your Claude Desktop. You can start a new conversation and ask Claude to read a specific page, summarize a database, or draft something based on workspace content, and it will.

    One hygiene note: Claude Desktop connections are per-account. If you have multiple Claude accounts (say, a personal Pro and a work Max), each one needs its own connection to Notion. The good news is you can point each one at a different Notion workspace — personal Claude at personal Notion, work Claude at work Notion. This is the operator pattern I recommend for anyone running more than one business context through Claude.


    Connecting remote MCP to Claude Code

    Claude Code is the path most operators actually run at depth, because it’s the version of Claude that lives in your terminal and can compose MCP calls into real workflows.

    The command is one line:

    claude mcp add --transport http notion https://mcp.notion.com/mcp

    Then authenticate by running /mcp inside Claude Code and following the OAuth flow. Browser opens, Notion asks you to authorize, you approve, and the connection is live.

    A few options worth knowing about at setup time:

    Scope. The --scope flag controls who gets access to the MCP server on your machine. Three options: local (default, just you in the current project), project (shared with your team via a .mcp.json file), and user (available to you across all projects). For Notion, user scope is usually right — you’ll want Claude to reach Notion from any project you’re working in, not just the current one.

    The richer integration. Notion also ships a plugin for Claude Code that bundles the MCP server along with pre-built Skills and slash commands for common Notion workflows. If you’re doing this seriously, install the plugin. It adds commands like generating briefs from templates and opening pages by name, and saves you from writing your own.

    Checking what’s connected. Inside Claude Code, /mcp lists every MCP server you’ve configured. /context tells you how many tokens each one is consuming in your current session. For Notion specifically, this is useful because MCP servers have non-zero context cost even when you’re not actively using them — every tool exposed by the server sits in Claude’s context, eating tokens. Running /context occasionally is how you notice when an MCP connection is heavier than you expected.


    The permissions pattern that actually protects you

    Now we’re past the mechanics and into the hygiene layer — the part that most guides don’t cover.

    Once Claude is connected to your Notion workspace, there are three specific configuration moves worth making. None of them are hard. All of them pay rent.

    1. Scope the workspace, don’t scope the connection

    The OAuth connection doesn’t let you say “Claude can see these pages but not those.” It lets you say “Claude can see this workspace.” So the place to draw the boundary is at the workspace level, not at the connection level.

    If you have sensitive content in your main workspace, move it. Create a separate workspace for Claude-facing content and keep the sensitive stuff out. Or use Notion’s teamspace feature (Business and Enterprise) to isolate access at the teamspace level.

    This feels like over-engineering until the first time Claude surfaces something in a response that you had forgotten was in your workspace. After that, it doesn’t feel like over-engineering.

    2. For Enterprise: turn on MCP Governance

    If you’re on the Enterprise plan, there’s an admin-level control worth enabling even if you trust your team. From Notion’s docs: with MCP Governance, Enterprise admins can approve specific AI tools and MCP clients that can connect to Notion MCP — for example Cursor, Claude, or ChatGPT. The approved-list pattern is opt-in: Settings → Connections → Permissions tab, set “Restrict AI tools members can connect to” to “Only from approved list.”

    Even if you only approve Claude today, the control gives you the ability to see every AI tool anyone on your team has connected, and to disconnect everything at once with the “Disconnect All Users” button if you ever need to. That’s the kind of control you want to have configured before you need it, not after.

    3. For local MCP: use a read-only integration token

    If you’re using the local path (the open-source @notionhq/notion-mcp-server), you have more granular control than the remote path gives you. Specifically: when you create the integration in Notion’s developer settings, you can set it to “Read content” only — no write access, no comment access, nothing but reads.

    A read-only integration is the right default for anything exploratory. If you want Claude to be able to write too, enable write access later when you’ve decided you trust the specific workflow. Don’t give write access by default just because the integration setup screen presents it as an option.

    This is the one place the local path is actually stronger than remote — you can shape the integration’s capabilities before you grant it access, and the integration only sees the specific pages you share with it. For high-sensitivity setups, this granularity is worth the tradeoff of running the legacy package.


    Prompt injection: the risk nobody wants to talk about

    One more thing before we leave the hygiene section. It’s the thing the industry is least comfortable being direct about.

    When Claude has access to your Notion workspace, Claude also reads whatever is in your Notion workspace. Including pages that came from outside. Including meeting notes that were imported from a transcript service. Including documents shared with you by clients. Including anything you pasted from the web.

    Every one of those is a potential vector for prompt injection — hidden instructions buried in content that, when Claude reads the content, hijack what Claude does next.

    This is not theoretical. Anthropic itself flags prompt injection risk in the MCP documentation: be especially careful when using MCP servers that could fetch untrusted content, as these can expose you to prompt injection risk. Notion has shipped detection for hidden instructions in uploaded files and flags suspicious links for user approval, but the attack surface is larger than any detection system can fully cover.

    The practical operator response is three-part:

    Don’t give Claude access to content you didn’t write, without reading it first. If a client sends you a document and you paste it into Notion and Claude has access to that database, you have effectively given Claude the ability to be instructed by your client’s document. This might be fine. It might be a problem. Read the document before it goes into a Claude-accessible location.

    Be suspicious of workflows that chain untrusted content into actions. A workflow where Claude reads a web-scraped summary and then uses that summary to decide which database row to update is a prompt injection target. If the scraped content can shape Claude’s action, the scraped content can be weaponized.

    Use write protections for anything consequential. Anything where the cost of Claude doing the wrong thing is real — sending an email, deleting a record, updating a client-facing page — belongs behind a human-approval gate. Claude Code supports “Always Ask” behavior per-tool; use it for writes.

    This sounds paranoid. It’s not paranoid. It’s the appropriate level of caution for a class of attack that is genuinely live and that the industry has not yet figured out how to fully defend against.


    What this actually enables (the payoff section)

    Once you’ve done the setup and the hygiene work, here’s what you now have.

    You can sit down at Claude and ask it questions that require real workspace context. What’s the status of the three projects I touched last week? Pull together everything we’ve decided about pricing across the client work this quarter. Draft a response to this incoming email using context from our ongoing conversation with this client. Claude reads the relevant pages, synthesizes across them, and responds with actual grounding — not a generic answer shaped by whatever prompt you happen to type.

    You can run Claude Code against your workspace for development-adjacent operations. Generate a technical spec from our product page notes. Create release notes from the changelog and feature pages. Find every page where we’ve documented this API endpoint and reconcile the inconsistencies.

    You can set up workflows that flow across tools. Claude reads from Notion, acts on another system via a different MCP server, writes results back to Notion. This is the agentic pattern the industry keeps talking about — and with the right permissions hygiene, it actually becomes usable instead of scary.

    None of this is theoretical. I use this pattern every working day. The value is real. The hygiene discipline is what keeps the value from turning into a liability.


    When this setup goes wrong (troubleshooting honestly)

    Five failure modes I’ve seen, in order of frequency.

    Claude doesn’t see the page you asked about. For remote MCP, this almost always means the page is in a workspace you’re not a member of, or in a teamspace you don’t have access to. For local MCP, it means the integration hasn’t been granted access to that specific page — you have to go to the page, click the three-dot menu, and add the integration manually.

    OAuth flow doesn’t complete. Usually a browser issue — popup blocker, wrong Notion account signed in, session expired. Clear auth, try again. If Claude Desktop, disconnect the connector entirely and re-add.

    The connection succeeds but Claude doesn’t seem to be using it. Run /mcp in Claude Code to verify the server is listed and connected. If it’s there and Claude still isn’t invoking it, the issue is usually in how you’re asking — Claude won’t reach for MCP tools just because they exist; you need to phrase the request in a way that makes it obvious the tool is relevant. Find the page about X in Notion works better than tell me about X.

    MCP server crashes or returns errors. For remote, this is rare and usually resolves itself — Notion’s hosted server has the standard cloud-reliability profile. For local, check your Node version (the server requires Node 18 or later), your config file syntax (JSON is unforgiving about trailing commas), and your token format.

    Context token budget goes through the roof. Every MCP server in your connected list contributes tools to Claude’s context on every request. If you have five MCP servers configured, that’s five sets of tool descriptions being loaded into every conversation. Run /context in Claude Code to see the cost. If it’s painful, disconnect the servers you’re not actively using.


    The mental model that keeps you sane

    Here’s the mental model I use for the whole setup. It’s short.

    Claude plus Notion is like giving a new, very capable employee access to your business. You wouldn’t hand a new hire every password, every file, every client record, every private note on day one. You’d give them access to the specific things they need to do the job, watch how they use that access, and expand trust over time based on track record.

    The MCP connection works exactly that way. You decide what Claude gets to see. You decide what Claude gets to write. You watch how it uses that access. You expand the boundary as trust earns itself.

    The operators who get hurt by this kind of setup are the ones who skip the first step and give Claude everything on day one. The operators who get the real value out of it are the ones who treat the connection the way they’d treat any other employee — with deliberate scope, real oversight, and the willingness to revoke access if something goes wrong.

    That’s the discipline. That’s the whole thing.


    FAQ

    Do I need to install anything to connect Claude to Notion? For remote MCP (the recommended path), no installation is required — you connect via OAuth through Claude Desktop’s Settings → Connectors or Claude Code’s claude mcp add command. For local MCP (legacy), you install @notionhq/notion-mcp-server via npm and create an internal Notion integration.

    What’s the URL for Notion’s remote MCP server? https://mcp.notion.com/mcp. Use HTTP transport (not the deprecated SSE transport).

    Can Claude see my entire Notion workspace by default? Yes. MCP tools act with your full Notion permissions — they can access everything you can access. The boundary is set by your workspace membership and teamspace access, not by the MCP connection itself. If you need finer-grained control, isolate Claude-facing content into a separate workspace or teamspace.

    Can I use Notion MCP with automated, headless agents? Remote Notion MCP requires OAuth authentication and doesn’t support bearer tokens, which makes it unsuitable for fully automated or headless workflows. For those cases, the legacy @notionhq/notion-mcp-server with an API token still works, but it’s being phased out.

    What plans support Notion MCP? Notion MCP works with all plans for connecting AI tools via MCP. Enterprise plans get admin-level MCP Governance controls (approved AI tool list, disconnect-all). Claude Desktop MCP connectors are available on Pro, Max, Team, and Enterprise plans.

    Can my company’s admins control which AI tools connect to our Notion workspace? Yes, on the Enterprise plan. Admins can restrict AI tool connections to an approved list through Settings → Connections → Permissions tab. Only admin-approved tools can connect.

    Is Notion MCP secure for confidential business data? The MCP protocol itself respects Notion’s permissions — it can’t bypass what you have access to. However, content flowing through MCP is processed by the AI tool you’ve connected (Claude, ChatGPT, etc.), which has its own data handling policies. For highly sensitive content, the right move is to isolate it in a workspace that Claude doesn’t have access to, rather than relying on the protocol alone to contain it.

    What about prompt injection attacks through Notion content? Real risk. Anthropic explicitly flags it in their MCP documentation. Notion has shipped detection for hidden instructions and flags suspicious links, but no detection system catches everything. The operator response: don’t give Claude access to content you didn’t write without reviewing it first, be suspicious of workflows where untrusted content shapes Claude’s actions, and put human-approval gates on anything consequential.

    What’s the difference between Notion’s built-in AI and connecting Claude via MCP? Notion’s built-in AI (Notion Agent and Custom Agents) runs inside Notion and uses Notion’s integration with frontier models. Connecting Claude via MCP brings Claude — your chosen model, in your chosen interface, with its full capability — to your workspace as an external client. The built-in option is simpler; the MCP option is more powerful and composable across other tools.


    Closing note

    Most tutorials treat the connection as the goal. The connection is the easy part. The hygiene is the part that matters.

    If you wire Claude into your Notion workspace thoughtlessly, you’ve given a capable AI access to every corner of your operational history, and you’ll be surprised how much of what’s in there you’d forgotten. If you wire it in deliberately — with a scoped workspace, with the permissions you’ve thought about, with the posture of giving a new employee measured access — you’ve built something that pays rent every day without ever becoming the liability it could have been.

    One hour of setup. One hour of cleanup. And then one of the most useful AI configurations currently possible in April 2026.

    The intersection of Notion and Claude is where the operator work actually happens now. Worth setting up right.


    Sources and further reading

  • The CLAUDE.md Playbook: How to Actually Guide Claude Code Across a Real Project (2026)

    The CLAUDE.md Playbook: How to Actually Guide Claude Code Across a Real Project (2026)

    Most writing about CLAUDE.md gets one thing wrong in the first paragraph, and once you notice it, you can’t unsee it. People describe it as configuration. A “project constitution.” Rules Claude has to follow.

    It isn’t any of those things, and Anthropic is explicit about it.

    CLAUDE.md content is delivered as a user message after the system prompt, not as part of the system prompt itself. Claude reads it and tries to follow it, but there’s no guarantee of strict compliance, especially for vague or conflicting instructions. — Anthropic, Claude Code memory docs

    That one sentence is the whole game. If you write a CLAUDE.md as if you’re programming a machine, you’ll get frustrated when the machine doesn’t comply. If you write it as context — the thing a thoughtful new teammate would want to read on day one — you’ll get something that works.

    This is the playbook I wish someone had handed me the first time I set one up across a real codebase. It’s grounded in Anthropic’s current documentation (linked throughout), layered with patterns I’ve used across a network of production repos, and honest about where community practice has outrun official guidance.

    If any of this ages out, the docs are the source of truth. Start there, come back here for the operator layer.


    The memory stack in 2026 (what CLAUDE.md actually is, and isn’t)

    Claude Code’s memory system has three parts. Most people know one of them, and the other two change how you use the first.

    CLAUDE.md files are markdown files you write by hand. Claude reads them at the start of every session. They contain instructions you want Claude to carry across conversations — build commands, coding standards, architectural decisions, “always do X” rules. This is the part people know.

    Auto memory is something Claude writes for itself. Introduced in Claude Code v2.1.59, it lets Claude save notes across sessions based on your corrections — build commands it discovered, debugging insights, preferences you kept restating. It lives at ~/.claude/projects/<project>/memory/ with a MEMORY.md entrypoint. You can audit it with /memory, edit it, or delete it. It’s on by default. (Anthropic docs.)

    .claude/rules/ is a directory of smaller, topic-scoped markdown files — code-style.md, testing.md, security.md — that can optionally be scoped to specific file paths via YAML frontmatter. A rule with paths: ["src/api/**/*.ts"] only loads when Claude is working with files matching that pattern. (Anthropic docs.)

    The reason this matters for how you write CLAUDE.md: once you understand what the other two are for, you stop stuffing CLAUDE.md with things that belong somewhere else. A 600-line CLAUDE.md isn’t a sign of thoroughness. It’s usually a sign the rules directory doesn’t exist yet and auto memory is disabled.

    Anthropic’s own guidance is explicit: target under 200 lines per CLAUDE.md file. Longer files consume more context and reduce adherence.

    Hold that number. We’ll come back to it.


    Where CLAUDE.md lives (and why scope matters)

    CLAUDE.md files can live in four different scopes, each with a different purpose. More specific scopes take precedence over broader ones. (Full precedence table in Anthropic docs.)

    Managed policy CLAUDE.md lives at the OS level — /Library/Application Support/ClaudeCode/CLAUDE.md on macOS, /etc/claude-code/CLAUDE.md on Linux and WSL, C:\Program Files\ClaudeCode\CLAUDE.md on Windows. Organizations deploy it via MDM, Group Policy, or Ansible. It applies to every user on every machine it’s pushed to, and individual settings cannot exclude it. Use it for company-wide coding standards, security posture, and compliance reminders.

    Project CLAUDE.md lives at ./CLAUDE.md or ./.claude/CLAUDE.md. It’s checked into source control and shared with the team. This is the one you’re writing when someone says “set up CLAUDE.md for this repo.”

    User CLAUDE.md lives at ~/.claude/CLAUDE.md. It’s your personal preferences across every project on your machine — favorite tooling shortcuts, how you like code styled, patterns you want applied everywhere.

    Local CLAUDE.md lives at ./CLAUDE.local.md in the project root. It’s personal-to-this-project and gitignored. Your sandbox URLs, preferred test data, notes Claude should know that your teammates shouldn’t see.

    Claude walks up the directory tree from wherever you launched it, concatenating every CLAUDE.md and CLAUDE.local.md it finds. Subdirectories load on demand — they don’t hit context at launch, but get pulled in when Claude reads files in those subdirectories. (Anthropic docs.)

    A practical consequence most teams miss: in a monorepo, your parent CLAUDE.md gets loaded when a teammate runs Claude Code from inside a nested package. If that parent file contains instructions that don’t apply to their work, Claude will still try to follow them. That’s what the claudeMdExcludes setting is for — it lets individuals skip CLAUDE.md files by glob pattern at the local settings layer.

    If you’re running Claude Code across more than one repo, decide now whether your standards belong in project CLAUDE.md (team-shared) or user CLAUDE.md (just you). Writing the same thing in both is how you get drift.


    The 200-line discipline

    This is the rule I see broken most often, and it’s the rule Anthropic is most explicit about. From the docs: “target under 200 lines per CLAUDE.md file. Longer files consume more context and reduce adherence.”

    Two things are happening in that sentence. One, CLAUDE.md eats tokens — every session, every time, whether Claude needed those tokens or not. Two, longer files don’t actually produce better compliance. The opposite. When instructions are dense and undifferentiated, Claude can’t tell which ones matter.

    The 200-line ceiling isn’t a hard cap. You can write a 400-line CLAUDE.md and Claude will load the whole thing. It just won’t follow it as well as a 180-line file would.

    Three moves to stay under:

    1. Use @imports to pull in specific files when they’re relevant. CLAUDE.md supports @path/to/file syntax (relative or absolute). Imported files expand inline at session launch, up to five hops deep. This is how you reference your README, your package.json, or a standalone workflow guide without pasting them into CLAUDE.md.

    See @README.md for architecture and @package.json for available scripts.
    
    # Git Workflow
    - @docs/git-workflow.md

    2. Move path-scoped rules into .claude/rules/. Anything that only matters when working with a specific part of the codebase — API patterns, testing conventions, frontend style — belongs in .claude/rules/api.md or .claude/rules/testing.md with a paths: frontmatter. They only load into context when Claude touches matching files.

    ---
    paths:
      - "src/api/**/*.ts"
    ---
    # API Development Rules
    
    - All API endpoints must include input validation
    - Use the standard error response format
    - Include OpenAPI documentation comments

    3. Move task-specific procedures into skills. If an instruction is really a multi-step workflow — “when you’re asked to ship a release, do these eight things” — it belongs in a skill, which only loads when invoked. CLAUDE.md is for the facts Claude should always hold in context; skills are for procedures Claude should run when the moment calls for them.

    If you follow these three moves, a CLAUDE.md rarely needs to exceed 150 lines. At that size, Claude actually reads it.


    What belongs in CLAUDE.md (the signal test)

    Anthropic’s own framing for when to add something is excellent, and it’s worth quoting directly because it captures the whole philosophy in four lines:

    Add to it when:

    • Claude makes the same mistake a second time
    • A code review catches something Claude should have known about this codebase
    • You type the same correction or clarification into chat that you typed last session
    • A new teammate would need the same context to be productive — Anthropic docs

    The operator version of the same principle: CLAUDE.md is the place you write down what you’d otherwise re-explain. It’s not the place you write down everything you know. If you find yourself writing “the frontend is built in React and uses Tailwind,” ask whether Claude would figure that out by reading package.json (it would). If you find yourself writing “when a user asks for a new endpoint, always add input validation and write a test,” that’s the kind of thing Claude won’t figure out on its own — it’s a team convention, not an inference from the code.

    The categories I’ve found actually earn their place in a project CLAUDE.md:

    Build and test commands. The exact string to run the dev server, the test suite, the linter, the type checker. Every one of these saves Claude a round of “let me look for a package.json script.”

    Architectural non-obvious. The thing a new teammate would need someone to explain. “This repo uses event sourcing — don’t write direct database mutations, emit events instead.” “We have two API surfaces, /public/* and /internal/*, and they have different auth requirements.”

    Naming conventions and file layout. “API handlers live in src/api/handlers/.” “Test files go next to the code they test, named *.test.ts.” Specific enough to verify.

    Coding standards that matter. Not “write good code” — “use 2-space indentation,” “prefer const over let,” “always export types separately from values.”

    Recurring corrections. The single most valuable category. Every time you find yourself re-correcting Claude about the same thing, that correction belongs in CLAUDE.md.

    What usually doesn’t belong:

    • Long lists of library choices (Claude can read package.json)
    • Full architecture diagrams (link to them instead)
    • Step-by-step procedures (skills)
    • Path-specific rules that only matter in one part of the repo (.claude/rules/ with a paths: field)
    • Anything that would be true of any project (that goes in user CLAUDE.md)

    Writing instructions Claude will actually follow

    Anthropic’s own guidance on effective instructions comes down to three principles, and every one of them is worth taking seriously:

    Specificity. “Use 2-space indentation” works better than “format code nicely.” “Run npm test before committing” works better than “test your changes.” “API handlers live in src/api/handlers/” works better than “keep files organized.” If the instruction can’t be verified, it can’t be followed reliably.

    Consistency. If two rules contradict each other, Claude may pick one arbitrarily. This is especially common in projects that have accumulated CLAUDE.md files across multiple contributors over time — one file says to prefer async/await, another says to use .then() for performance reasons, and nobody remembers which was right. Do a periodic sweep.

    Structure. Use markdown headers and bullets. Group related instructions. Dense paragraphs are harder to scan, and Claude scans the same way you do. A CLAUDE.md with clear section headers — ## Build Commands, ## Coding Style, ## Testing — outperforms the same content run together as prose.

    One pattern I’ve found useful that isn’t in the docs: write CLAUDE.md in the voice of a teammate briefing another teammate. Not “use 2-space indentation” but “we use 2-space indentation.” Not “always include input validation” but “every endpoint needs input validation — we had a security incident last year and this is how we prevent the next one.” The “why” is optional but it improves adherence because Claude treats the rule as something with a reason behind it, not an arbitrary preference.


    Community patterns worth knowing (flagged as community, not official)

    The following are patterns I’ve seen in operator circles and at industry events like AI Engineer Europe 2026, where practitioners share how they’re running Claude Code in production. None of these are in Anthropic’s documentation as official guidance. I’ve included them because they’re useful; I’m flagging them because they’re community-origin, not doctrine. Your mileage may vary, and Anthropic’s official behavior could change in ways that affect these patterns.

    The “project constitution” framing. Community shorthand for treating CLAUDE.md as the living document of architectural decisions — the thing new contributors read to understand how the project thinks. The framing is useful even though Anthropic doesn’t use the word. It captures the right posture: CLAUDE.md is the place for the decisions you want to outlast any individual conversation.

    Prompt-injecting your own codebase via custom linter errors. Reported at AI Engineer Europe 2026: some teams embed agent-facing prompts directly into their linter error messages, so when an automated tool catches a mistake, the error text itself tells the agent how to fix it. Example: instead of a test failing with “type mismatch,” the error reads “You shouldn’t have an unknown type here because we parse at the edge — use the parsed type from src/schemas/.” This is not documented Anthropic practice; it’s a community pattern that works because Claude Code reads tool output and tool output flows into context. Use with judgment.

    File-size lint rules as context-efficiency guards. Some teams enforce file-size limits (commonly cited: 350 lines max) via their linters, with the explicit goal of keeping files small enough that Claude can hold meaningful ones in context without waste. Again, community practice. The number isn’t magic; the discipline is.

    Token Leverage as a team metric. The idea that teams should track token spend ÷ human labor spend as a ratio and try to scale it. This is business-strategy content, not engineering guidance, and it’s emerging community discourse rather than settled practice. Take it as a thought experiment, not a KPI to implement by Monday.

    I’d rather flag these honestly than pretend they’re settled. If something here graduates from community practice to official recommendation, I’ll update.


    Enterprise: managed-policy CLAUDE.md (and when to use settings instead)

    For organizations deploying Claude Code across teams, there’s a managed-policy CLAUDE.md that applies to every user on a machine and cannot be excluded by individual settings. It lives at /Library/Application Support/ClaudeCode/CLAUDE.md (macOS), /etc/claude-code/CLAUDE.md (Linux and WSL), or C:\Program Files\ClaudeCode\CLAUDE.md (Windows), and is deployed via MDM, Group Policy, Ansible, or similar.

    The distinction that matters most for enterprise: managed CLAUDE.md is guidance, managed settings are enforcement. Anthropic is clear about this. From the docs:

    Settings rules are enforced by the client regardless of what Claude decides to do. CLAUDE.md instructions shape Claude’s behavior but are not a hard enforcement layer. — Anthropic docs

    If you need to guarantee that Claude Code can’t read .env files or write to /etc, that’s a managed settings concern (permissions.deny). If you want Claude to be reminded of your company’s code review standards, that’s managed CLAUDE.md. If you confuse the two and put your security policy in CLAUDE.md, you have a strongly-worded suggestion where you needed a hard wall.

    Building With Claude?

    I’ll send you the CLAUDE.md cheat sheet personally.

    If you’re in the middle of a real project and this playbook is helping — or raising more questions — just email me. I read every message.

    Email Will → will@tygartmedia.com

    The right mental model:

    Concern Configure in
    Block specific tools, commands, or file paths Managed settings (permissions.deny)
    Enforce sandbox isolation Managed settings (sandbox.enabled)
    Authentication method, organization lock Managed settings
    Environment variables, API provider routing Managed settings
    Code style and quality guidelines Managed CLAUDE.md
    Data handling and compliance reminders Managed CLAUDE.md
    Behavioral instructions for Claude Managed CLAUDE.md

    (Full table in Anthropic docs.)

    One practical note: managed CLAUDE.md ships to developer machines once, so it has to be right. Review it, version it, and treat changes to it the way you’d treat changes to a managed IDE configuration — because that’s what it is.


    The living document problem: auto memory, CLAUDE.md, and drift

    The thing that changed most in 2026 is that Claude now writes memory for itself when auto memory is enabled (on by default since Claude Code v2.1.59). It saves build commands it discovered, debugging insights, preferences you expressed repeatedly — and loads the first 200 lines (or 25KB) of its MEMORY.md at every session start. (Anthropic docs.)

    This changes how you think about CLAUDE.md in two ways.

    First, you don’t need to write CLAUDE.md entries for everything Claude could figure out on its own. If you tell Claude once that the build command is pnpm run build --filter=web, auto memory might save that, and you won’t need to codify it in CLAUDE.md. The role of CLAUDE.md becomes more specifically about what the team has decided, rather than what the tool needs to know to function.

    Second, there’s a new audit surface. Run /memory in a session and you can see every CLAUDE.md, CLAUDE.local.md, and rules file being loaded, plus a link to open the auto memory folder. The auto memory files are plain markdown. You can read, edit, or delete them.

    A practical auto-memory hygiene pattern I’ve landed on:

    • Once a month, open /memory and skim the auto memory folder. Anything stale or wrong gets deleted.
    • Quarterly, review the CLAUDE.md itself. Has anything changed in how the team works? Are there rules that used to matter but don’t anymore? Conflicting instructions accumulate faster than you think.
    • Whenever a rule keeps getting restated in conversation, move it from conversation to CLAUDE.md. That’s the signal Anthropic’s own docs describe, and it’s the right one.

    CLAUDE.md files are living documents or they’re lies. A CLAUDE.md from six months ago that references libraries you’ve since replaced will actively hurt you — Claude will try to follow instructions that no longer apply.


    A representative CLAUDE.md template

    What follows is a synthetic example, clearly not any specific project. It demonstrates the shape, scope, and discipline of a good project CLAUDE.md. Adapt it to your codebase. Keep it under 200 lines.

    # Project: [Name]
    
    ## Overview
    Brief one-paragraph description of what this project is and who uses it.
    Link to deeper architecture docs rather than duplicating them here.
    
    See @README.md for full architecture.
    
    ## Build and Test Commands
    - Install: `pnpm install`
    - Dev server: `pnpm run dev`
    - Build: `pnpm run build`
    - Test: `pnpm test`
    - Type check: `pnpm run typecheck`
    - Lint: `pnpm run lint`
    
    Run `pnpm run typecheck` and `pnpm test` before committing. Both must pass.
    
    ## Tech Stack
    (Only list the non-obvious choices. Claude can read package.json.)
    - We use tRPC, not REST, for internal APIs.
    - Styling is Tailwind with a custom token file at `src/styles/tokens.ts`.
    - Database migrations via Drizzle, not Prisma (migrated in Q1 2026).
    
    ## Directory Layout
    - `src/api/` — tRPC routers, grouped by domain
    - `src/components/` — React components, one directory per component
    - `src/lib/` — shared utilities, no React imports allowed here
    - `src/server/` — server-only code, never imported from client
    - `tests/` — integration tests (unit tests live next to source)
    
    ## Coding Conventions
    - TypeScript strict mode. No `any` without a comment explaining why.
    - Functional components only. No class components.
    - Imports ordered: external, internal absolute, relative.
    - 2-space indentation. Prettier config in `.prettierrc`.
    
    ## Conventions That Aren't Obvious
    - Every API endpoint validates input with Zod. No exceptions.
    - Database queries go through the repository layer in `src/server/repos/`. 
      Never import Drizzle directly from route handlers.
    - Errors surfaced to the UI use the `AppError` class from `src/lib/errors.ts`.
      This preserves error codes for the frontend to branch on.
    
    ## Common Corrections
    - Don't add new top-level dependencies without discussing first.
    - Don't create new files in `src/lib/` without checking if a similar 
      utility already exists.
    - Don't write tests that hit the real database. Use the test fixtures 
      in `tests/fixtures/`.
    
    ## Further Reading
    - API design rules: @.claude/rules/api.md
    - Testing conventions: @.claude/rules/testing.md
    - Security: @.claude/rules/security.md

    That’s roughly 70 lines. Notice what it doesn’t include: no multi-step procedures, no duplicated information from package.json, no universal-best-practice lectures. Every line is either a command you’d otherwise re-type, a convention a new teammate would need briefed, or a pointer to a more specific document.


    When CLAUDE.md still isn’t being followed

    This happens to everyone eventually. Three debugging steps, in order:

    1. Run /memory and confirm your file is actually loaded. If CLAUDE.md isn’t in the list, Claude isn’t reading it. Check the path — project CLAUDE.md can live at ./CLAUDE.md or ./.claude/CLAUDE.md, not both, not a subdirectory (unless Claude happens to be reading files in that subdirectory).

    2. Make the instruction more specific. “Write clean code” is not an instruction Claude can verify. “Use 2-space indentation” is. “Handle errors properly” is not an instruction. “All errors surfaced to the UI must use the AppError class from src/lib/errors.ts” is.

    3. Look for conflicting instructions. A project CLAUDE.md saying “prefer async/await” and a .claude/rules/performance.md saying “use raw promises for hot paths” will cause Claude to pick one arbitrarily. In monorepos this is especially common — an ancestor CLAUDE.md from a different team can contradict yours. Use claudeMdExcludes to skip irrelevant ancestors.

    If you need guarantees rather than guidance — “Claude cannot, under any circumstances, delete this directory” — that’s a settings-level permissions concern, not a CLAUDE.md concern. Write the rule in settings.json under permissions.deny and the client enforces it regardless of what Claude decides.


    FAQ

    What is CLAUDE.md? A markdown file Claude Code reads at the start of every session to get persistent instructions for a project. It lives in a project’s source tree (usually at ./CLAUDE.md or ./.claude/CLAUDE.md), gets loaded into the context window as a user message after the system prompt, and contains coding standards, build commands, architectural decisions, and other team-level context. Anthropic is explicit that it’s guidance, not enforcement. (Source.)

    How long should a CLAUDE.md be? Under 200 lines. Anthropic’s own guidance is that longer files consume more context and reduce adherence. If you’re over that, split with @imports or move topic-specific rules into .claude/rules/.

    Where should CLAUDE.md live? Project-level: ./CLAUDE.md or ./.claude/CLAUDE.md, checked into source control. Personal-global: ~/.claude/CLAUDE.md. Personal-project (gitignored): ./CLAUDE.local.md. Organization-wide (enterprise): /Library/Application Support/ClaudeCode/CLAUDE.md (macOS), /etc/claude-code/CLAUDE.md (Linux/WSL), or C:\Program Files\ClaudeCode\CLAUDE.md (Windows).

    What’s the difference between CLAUDE.md and auto memory? CLAUDE.md is instructions you write for Claude. Auto memory is notes Claude writes for itself across sessions, stored at ~/.claude/projects/<project>/memory/. Both load at session start. CLAUDE.md is for team standards; auto memory is for build commands and preferences Claude picks up from your corrections. Auto memory requires Claude Code v2.1.59 or later.

    Can Claude ignore my CLAUDE.md? Yes. CLAUDE.md is loaded as a user message and Claude “reads it and tries to follow it, but there’s no guarantee of strict compliance.” For hard enforcement (blocking file access, sandbox isolation, etc.) use settings, not CLAUDE.md.

    Does AGENTS.md work for Claude Code? Claude Code reads CLAUDE.md, not AGENTS.md. If your repo already uses AGENTS.md for other coding agents, create a CLAUDE.md that imports it with @AGENTS.md at the top, then append Claude-specific instructions below.

    What’s .claude/rules/ and when should I use it? A directory of smaller, topic-scoped markdown files that can optionally be scoped to specific file paths via YAML frontmatter. Use it when your CLAUDE.md is getting long or when instructions only matter in part of the codebase. Rules without a paths: field load at session start with the same priority as .claude/CLAUDE.md; rules with a paths: field only load when Claude works with matching files.

    How do I generate a starter CLAUDE.md? Run /init inside Claude Code. It analyzes your codebase and produces a starting file with build commands, test instructions, and conventions it discovers. Refine from there with instructions Claude wouldn’t discover on its own.


    A closing note

    The biggest mistake I see people make with CLAUDE.md isn’t writing it wrong — it’s writing it once and forgetting it exists. Six months later it references libraries they’ve since replaced, conventions that have since shifted, and a team structure that has since reorganized. Claude dutifully tries to follow instructions that no longer apply, and the team wonders why the tool seems to have gotten worse.

    CLAUDE.md is a living document or it’s a liability. Treat it the way you’d treat a critical piece of onboarding documentation, because functionally that’s exactly what it is — onboarding for the teammate who shows up every session and starts from zero.

    Write it for that teammate. Keep it short. Update it when reality shifts. And remember the part nobody likes to admit: it’s guidance, not enforcement. For anything that has to be guaranteed, reach for settings instead.


    Sources and further reading

    Community patterns referenced in this piece were reported at AI Engineer Europe 2026 and captured in a session recap. They represent emerging practice, not Anthropic doctrine.

  • Task Budgets, xhigh, and the 2,576px Vision Ceiling: Opus 4.7’s Most Interesting Features Explained

    Task Budgets, xhigh, and the 2,576px Vision Ceiling: Opus 4.7’s Most Interesting Features Explained

    What this article covers

    Three features in Opus 4.7 deserve their own explanation because they change what’s actually possible in daily work, not just what’s bigger on a benchmark chart:

    1. Task budgets (beta) — per-subtask ceilings that tame agent cost variance.
    2. The xhigh effort level — the new reasoning-control setting between high and max.
    3. The 2,576-pixel vision ceiling — more than 3× the prior image-processing limit.

    Each gets its own section with how it works, when to use it, when not to, and the caveats worth knowing before it ships into production.


    Feature 1: Task budgets (beta)

    What it is. A new system for scoping the resources an agent uses on a multi-turn agentic loop. Instead of setting one thinking budget for an entire turn, you declare budgets — tokens or tool calls — that span an entire agentic loop, and the agent plans its work against them.

    The problem it solves. Agent runs have notoriously high cost variance. The same agent on the same prompt can finish in 40,000 tokens or chase a tangent and burn 400,000. Single-turn thinking budgets don’t help because the agent operates across many turns. Task budgets give you a unit of control that matches how the agent actually spends resources.

    How the agent uses them. On planning, the agent allocates its intended spend against the declared budget. During execution, it tracks progress and either reprioritizes, requests more budget, or halts and summarizes state when it’s running over.

    Behavior note: budgets are soft, not hard. The agent is nudged to respect them, not hard-cut. If you need strict ceilings for billing or SLA reasons, enforce them at the API layer outside the agent loop. Task budgets are for behavior shaping, not hard resource limiting.

    When to use them.
    – Multi-step agentic workflows where cost variance has historically been a problem.
    – Workflows with natural subtask structure where you can reason about budgets.
    – Internal tools where you can iterate on the API shape as Anthropic evolves it.

    When not to use them.
    – Simple single-turn requests. Task budgets are overhead that doesn’t pay off on short interactions.
    – Production contracts that are painful to version. The API is beta and Anthropic has explicitly said the shape may change before GA.
    – Workflows where you need provable hard cutoffs. Enforce those at the API layer, not via this feature.

    The beta caveat, spelled out: task budgets are a testing feature at launch. Parameter names and shape may change. Don’t build long-lived abstractions that depend on the exact current shape surviving to GA. Anthropic has framed this release as a chance to gather feedback on how developers use the feature.


    Feature 2: The xhigh effort level

    What it is. A new setting for reasoning effort, slotted between high and max. Opus 4.6 had three levels: low, medium, high. Opus 4.7 adds xhigh, making four: low, medium, high, xhigh, plus max at the top.

    Why it exists. Anthropic’s framing in the release materials: xhigh gives users “finer control over the tradeoff between reasoning and latency on hard problems.” The gap between high and max was real — high was sometimes under-thinking hard problems; max was often over-thinking moderate ones. xhigh smooths the curve by giving you a setting that’s more thoughtful than high without the runaway token budget of max.

    Anthropic’s own guidance. “When testing Opus 4.7 for coding and agentic use cases, we recommend starting with high or xhigh effort.” That’s a direct recommendation to make xhigh part of your default rotation for serious work, not a niche escalation.

    How to use it.
    – Keep high as the default for routine work.
    – Use xhigh as the new first-choice escalation when high isn’t quite getting there — or start there for coding and agentic tasks per Anthropic’s recommendation.
    – Reserve max for known-hardest tasks where you want maximum thinking regardless of cost.

    Important tradeoff. Higher effort levels in 4.7 produce more output tokens than the same levels did in 4.6. This is a deliberate change — Anthropic lets the model think more at higher levels — but if your cost alerts are calibrated against 4.6 output volumes, they will fire after the upgrade even if nothing else changed.

    An API note worth flagging. Opus 4.7 removed the extended thinking budget parameter that existed in 4.6. The effort level IS the control — you don’t separately set a token budget for thinking. If your 4.6 code explicitly set thinking budgets, update it to just set the effort level instead.

    xhigh is available via API, Bedrock, Vertex AI, and Microsoft Foundry. On Claude.ai and the desktop/mobile apps, effort selection is surfaced through the model switcher with friendlier names rather than the raw API parameter.


    Feature 3: The 2,576-pixel vision ceiling

    What changed. Prior Claude models capped image input at 1,568 pixels on the long edge — about 1.15 megapixels. Opus 4.7 processes images up to 2,576 pixels on the long edge — about 3.75 megapixels, more than 3× the prior pixel budget.

    Why this matters more than it sounds. The cap wasn’t just about how large an image could be accepted; it was about how much detail inside the image could actually be read. Under the old 1.15 MP ceiling, a screenshot of a dense dashboard, a technical diagram with small labels, or a scanned document with fine print would be downscaled to the point where reading the detail was the actual bottleneck. 4.7 removes that bottleneck for images up to the new ceiling.

    Coordinate mapping is now 1:1. This is a separate but related change. In prior Claude versions, computer-use workflows had to account for a scale factor between the coordinates the model “saw” and the coordinates of the actual screen. On Opus 4.7, the model’s coordinate output maps 1:1 to actual image pixels. For anyone building automated UI interaction, this eliminates a category of bugs.

    What this enables that 4.6 struggled with:

    • Dense UI screenshots. Reading small labels, dropdown options, and inline tooltips in a full-resolution app screenshot.
    • Technical diagrams. Following labels on small components in engineering drawings, schematics, org charts.
    • Scanned documents. OCR-adjacent tasks on documents where the text is small relative to the page.
    • Chart details. Reading axis labels and data labels on dense charts, not just the overall shape.
    • Multi-panel content. Comics, infographics, and documents with small type in multiple zones.
    • Pointing, measuring, counting. Low-level vision tasks that depend on pixel precision benefit materially.
    • Bounding-box detection. Image localization tasks show clear gains.

    What it doesn’t change.

    • Images beyond 2,576px still get downscaled to the ceiling. The ceiling is higher; it’s not gone.
    • Video frames are handled differently and aren’t covered by this change.
    • Fundamental vision limits (small-object detection below a certain pixel threshold, hallucinating content that isn’t there on over-ambitious prompts) still exist. More pixels ≠ omniscience.

    Pricing and token cost. Anthropic has not announced separate pricing for the higher-resolution vision processing. Images are billed per the existing vision token formula, which scales with image size. Larger images cost more tokens; that’s not new. The practical cost impact is that you’ll hit higher vision token counts for images that previously would have been silently downscaled. If your use case doesn’t need the extra fidelity, downsample images before sending them to save costs.

    How to use it.

    Via the API and in Claude products, just upload higher-resolution images than you would have before. No special parameter. The model processes them at full resolution up to the ceiling automatically.

    response = client.messages.create(
        model="claude-opus-4-7",
        max_tokens=4096,
        messages=[{
            "role": "user",
            "content": [
                {"type": "image", "source": {...}},  # up to 2576px long edge
                {"type": "text", "text": "Extract the values from the chart."},
            ],
        }],
    )
    

    A caveat worth noting. The 2,576px ceiling is the processing ceiling. Client-side size limits (file size, API request size) still apply. Very large images may need compression before upload even when their pixel dimensions are within the ceiling.


    How these three features compose

    The three features aren’t independent. For agentic coding work in particular, they compose in ways that matter.

    A practical workflow: an agent reviewing a UI bug gets a screenshot of the bug state (vision at 2,576px captures the detail), thinks about it at xhigh effort (enough reasoning without max’s overhead), and runs under a task budget that caps how much it can spend on this particular investigation before escalating or returning. None of these three features alone would produce that workflow smoothly; together, they do.

    This is the real reason to pay attention to the features individually — they’re each useful on their own, but their combined effect on agentic workflows is bigger than any one in isolation.


    Frequently asked questions

    Are task budgets available on Claude.ai, or API only?
    API only. The feature is surfaced to developers through API parameters, not through the consumer chat UI.

    Can I use xhigh on Claude.ai?
    Effort level is exposed to consumers through the model switcher. The underlying xhigh value is available via API; the consumer surface uses friendlier naming rather than the raw parameter.

    Does the 2,576px vision ceiling apply to all Claude products?
    Yes — Claude.ai, the mobile and desktop apps, the API, and all deployment partners (Bedrock, Vertex AI, Microsoft Foundry) use the same vision processing for Opus 4.7.

    Are task budgets a replacement for max_tokens?
    No. max_tokens is a hard cap on output length for a single message. Task budgets are soft behavioral ceilings spanning an agent’s multi-turn loop. Use both.

    Does xhigh use a different API parameter than high?
    No — it’s just another value for the same effort parameter. Note that Opus 4.7 removed the separate extended thinking budget parameter that existed on 4.6: the effort level IS the thinking control on 4.7.

    Will these features come to Opus 4.6?
    No. They’re Opus 4.7 features. 4.6 continues to run on its prior behavior.

    Does xhigh cost more than high?
    Yes, indirectly. Per-token pricing is the same. But xhigh produces more output tokens on hard problems (that’s the point — more thinking), so a given request costs more at xhigh than at high. xhigh is still meaningfully cheaper than max on the same task.


    Related reading

    • The full release: Claude Opus 4.7 — Everything New
    • For developers: Opus 4.7 for coding in practice
    • Comparison: Opus 4.7 vs GPT-5.4 vs Gemini 3.1 Pro
    • The Mythos angle: why Anthropic admitted Opus 4.7 is weaker than an unreleased model

    Published April 16, 2026. Article written by Claude Opus 4.7.

  • Claude Opus 4.7 vs GPT-5.4 vs Gemini 3.1 Pro: Head-to-Head in April 2026

    Claude Opus 4.7 vs GPT-5.4 vs Gemini 3.1 Pro: Head-to-Head in April 2026

    The short verdict

    • Best for agentic coding and long-horizon engineering: Opus 4.7.
    • Best for single-turn function calling and ecosystem breadth: GPT-5.4.
    • Best for multimodal input volume and long-context retrieval: Gemini 3.1 Pro.
    • Cheapest at the frontier: Gemini 3.1 Pro. Most expensive: GPT-5.4.
    • If you can only pick one for general knowledge work in April 2026: Opus 4.7.

    The full reasoning is below. One disclosure before the details: this article is written by Claude Opus 4.7. I am one of the models being compared. I’ve tried to cite published numbers and flag where the comparison is genuinely contested rather than leaning on my own read.


    Pricing as of April 16, 2026

    Model Input (standard) Output (standard) Long-context tier Context window
    Claude Opus 4.7 $5 / M tokens $25 / M tokens Same across window 1M tokens
    GPT-5.4 $2.50 / M tokens $15 / M tokens $5 / $22.50 over 272K 1M tokens (272K before surcharge)
    Gemini 3.1 Pro $2 / M tokens $12 / M tokens $4 / $18 over 200K 1M tokens (some listings cite 2M)

    Takeaways:
    – Gemini 3.1 Pro is the cheapest per token at the frontier — 7.5× cheaper on input than Opus 4.7 and 2× cheaper than GPT-5.4 at standard context.
    – GPT-5.4 sits in the middle on price and has a significant long-context surcharge cliff at 272K.
    – Opus 4.7 is the most expensive per token, with no long-context surcharge.
    – All three now have 1M-class context windows, but Opus 4.7’s pricing stays flat across the whole window while Gemini and GPT-5.4 both tier up past thresholds.

    Tokenizer caveat: Opus 4.7 uses a new tokenizer that produces up to 1.35× more tokens per input than Opus 4.6 did, depending on content type. Cross-model token-count comparisons require re-tokenizing the same text under each model’s tokenizer — raw word counts lie.


    Benchmarks, with the caveats included

    Anthropic, OpenAI, and Google all publish benchmark numbers. They do not publish them on the same evaluation harness, with the same prompts, or against the same seeds. Treat the following as directional, not definitive.

    Agentic coding (long-horizon, multi-file):
    – Opus 4.7 leads on Anthropic’s reported industry and internal agentic coding benchmarks.
    – GPT-5.4 is competitive on single-turn function calling and tool use. Roughly 80% on SWE-bench Verified at launch.
    – Gemini 3.1 Pro scored 80.6% on SWE-bench Verified at launch — essentially tied with GPT-5.4.

    Multidisciplinary reasoning (GPQA Diamond and similar):
    – Opus 4.7 leads on Anthropic’s comparisons.
    – GPT-5.4 and Gemini 3.1 Pro are close. Gemini reports 94.3% on GPQA Diamond.

    Scaled tool use and agentic computer use:
    – Opus 4.7 leads on Anthropic’s reported benchmarks.
    – GPT-5.4 has a native Computer Use API that scores 75% on OSWorld — the leading published figure at release.
    – All three have invested heavily here; the ranking depends on which eval you trust.

    Vision (document understanding, dense-screenshot extraction):
    – Opus 4.7’s jump from 1.15 MP to 3.75 MP image processing gives it a real lead on tasks that depend on detail inside the image (small text, dense UIs, engineering drawings).
    – Gemini 3.1 Pro is strong on native multimodal workflows with video and mixed media.
    – GPT-5.4 is solid but not leading on either axis.

    Long-context retrieval:
    – All three now have 1M-class context windows.
    – Gemini 3.1 Pro’s pricing tier structure makes it the cost-effective choice for bulk long-context work if your workflow frequently exceeds 200K tokens.
    – Opus 4.7 has flat pricing across its 1M window, which matters for unpredictable context shapes.
    – GPT-5.4’s 272K cliff means long-context workloads are meaningfully more expensive on OpenAI than on Anthropic or Google.

    Specialized coding benchmarks:
    – GPT-5.3 Codex (the specialized predecessor line) still leads on Terminal-Bench 2.0 and SWE-Bench Pro on some scores. GPT-5.4 has absorbed much of Codex’s capability but still trails slightly on pure coding niches.
    – Gemini 3.1 Pro has notable strength on creative coding and SVG generation.
    – Opus 4.7 is strongest on agentic and multi-file coding specifically.

    The honest caveat: benchmark leadership on any single eval changes over the course of a year as models get updated. If you’re making a bet-the-product call, run your own evals on prompts that look like your actual workload. The published benchmarks are a screening tool, not a decision tool.


    How they differ in behavior, not just benchmarks

    Opus 4.7 — the engineering-minded generalist.
    Tends toward thoroughness over speed. More likely than GPT-5.4 to push back on an ambiguous spec and ask a clarifying question; more likely than Gemini to surface tradeoffs rather than pick one and commit. Strong at long-horizon tasks where state matters. Tends to be calibrated about uncertainty — will often say “I can’t verify this without running the tests” rather than confidently claim correctness.

    GPT-5.4 — the product-native operator.
    Tends toward action over deliberation. Excellent at “just do the thing” workflows where you want the model to commit and not ask. Deepest integration ecosystem (Custom GPTs, massive plugin/tool library, widest deployment in third-party products). Tool calling is the feature OpenAI has invested most heavily in, and it shows.

    Gemini 3.1 Pro — the multimodal long-context specialist.
    Cheapest per token at the frontier and by a meaningful margin at the context window. Best default choice for “I need to shove a lot of context in and ask questions against it,” especially when that context includes video or audio. Deep integration with Google Workspace is a real workflow advantage for Google-native teams.

    None of these are absolute; all three models handle general tasks well. These are behavioral tendencies, not capability ceilings.


    “Choose X if” decision framework

    Choose Claude Opus 4.7 if:
    – Your primary workload is coding, especially agentic or multi-file coding.
    – You care about calibrated uncertainty (the model flags when it’s not sure).
    – You’re using or planning to use Claude Code for engineering work.
    – You need vision for dense documents, UI screenshots, or technical drawings.
    – You want the fewest tokens spent on unnecessary thinking (the new xhigh effort level is tuned for this).

    Choose GPT-5.4 if:
    – Single-turn tool use and function calling are the hot path in your product.
    – You need the broadest ecosystem of third-party integrations right now.
    – Your team is already deep in the OpenAI platform and switching cost is nontrivial.
    – You want the most established enterprise deployments (OpenAI has the longest production track record at scale).

    Choose Gemini 3.1 Pro if:
    – You’re price-sensitive and running high-volume workloads.
    – You need 1M+ token context as the default, not as an add-on.
    – Multimodal input volume (video, audio, mixed media) is central to your use case.
    – Your team is deep in Google Cloud or Workspace.

    Use multiple if:
    – You’re doing serious AI product work. Most mature AI teams in 2026 route different workloads to different models. A common pattern: Opus 4.7 for code generation and agent orchestration, Gemini 3.1 Pro for long-context retrieval and cheap bulk processing, GPT-5.4 for single-turn tool-heavy interactions.


    Where this comparison will change

    The frontier is moving. Three things to watch over the next six months:

    1. Claude Mythos Preview. Anthropic publicly acknowledged that Mythos outperforms Opus 4.7 on most of the benchmarks in the 4.7 release post. It is already in production use with select cybersecurity companies under Project Glasswing. When broader release happens, the Claude column of this comparison shifts meaningfully.

    2. GPT-5.5 / GPT-6. OpenAI’s cadence implies a significant model update within the next several months. The pattern over the past year has been incremental 5.x releases; a ground-up generation shift would reset the comparison.

    3. Gemini 3.5 / 4. Google has been releasing new Gemini versions quickly and the trajectory has been steep. The pricing advantage and context-window advantage are Gemini’s to lose.

    None of these are speculation-free predictions. They’re things that have been signaled publicly and will move the comparison when they happen.


    Frequently asked questions

    Is Claude Opus 4.7 better than GPT-5.4?
    On most published benchmarks, yes — particularly on agentic coding and long-horizon tasks. GPT-5.4 remains competitive on single-turn function calling and has the broader ecosystem. “Better” depends on the workload.

    Is Gemini 3.1 Pro cheaper than Opus 4.7?
    Significantly. At $2/$12 per million input/output tokens vs. Opus 4.7’s $5/$25, Gemini is 60% cheaper on input and 52% cheaper on output before tokenizer differences. At scale this is a material cost gap.

    Which model has the biggest context window?
    All three now have 1M-class context windows. Some Gemini 3.1 Pro documentation cites a 2M window. GPT-5.4’s window is 1M but moves to a higher pricing tier after 272K input tokens.

    Which model is best for coding?
    Opus 4.7 leads on agentic and long-horizon coding benchmarks. GPT-5.4 is close on single-turn coding. Gemini 3.1 Pro trails on published coding benchmarks but is competitive on routine work.

    Which model should I use for my startup?
    Most mature teams route workloads to multiple models. If you’re just starting and need to pick one, Opus 4.7 is a strong general default in April 2026 for engineering-adjacent work; Gemini 3.1 Pro if cost or context window dominates your decision; GPT-5.4 if you’re already on the OpenAI platform and the switching cost is high.

    Does Claude Opus 4.7 support function calling?
    Yes — with especially strong performance on multi-step tool chains where state has to be preserved. For single-turn tool calling, GPT-5.4 is competitive or leading depending on the benchmark.


    Related reading

    • Full Opus 4.7 feature set: Claude Opus 4.7 — Everything New
    • Opus 4.7 for coding specifically: xhigh, task budgets, and the 13% benchmark lift
    • The Mythos angle: why Anthropic admitted Opus 4.7 is weaker than an unreleased model

    Published April 16, 2026. Article written by Claude Opus 4.7 — yes, one of the models being compared. Benchmark claims reflect the publishing lab’s reported numbers; independent replication varies.

  • Opus 4.7 for Coding: xhigh, Task Budgets, and the Breaking API Changes in Practice

    Opus 4.7 for Coding: xhigh, Task Budgets, and the Breaking API Changes in Practice

    What changed if you only have 60 seconds

    • Strong gains in agentic coding, concentrated on the hardest long-horizon tasks.
    • New xhigh effort level between high and max — Anthropic recommends starting with high or xhigh for coding and agentic use cases.
    • Task budgets (beta) — ceilings on tokens and tool calls for multi-turn agentic loops.
    • Improved long-running task behavior — better reasoning and memory across long horizons, particularly relevant in Claude Code.
    • /ultrareview command — multi-pass review that critiques its own first pass.
    • Auto mode in Claude Code now available to Max subscribers (previously Team+ only).
    • ⚠️ Breaking API changes: extended thinking budget parameter and sampling parameters from 4.6 are removed. Update client code before switching model strings.
    • Tokenizer change: expect up to 1.35× more tokens for the same input.
    • Context window: unchanged at 1M tokens.

    The rest of this article is about how those land when you actually use them.


    The coding gain — what it actually feels like

    Anthropic’s release materials describe Opus 4.7 as “a notable improvement on Opus 4.6 in advanced software engineering, with particular gains on the most difficult tasks.” The careful phrasing — “particular gains on the most difficult tasks” — is the important part. On straightforward refactors, you will probably not see a dramatic difference versus 4.6. On long-horizon, multi-file, ambiguous-spec work, you likely will.

    In practice, the shift is: 4.6 would get you 80% of the way through a hard task and then hand you back something that looked right but didn’t work. 4.7 is more likely to actually close the task. It also “gives up gracefully” more often — saying “I can’t verify this works because I can’t run the test suite in this environment” instead of confidently claiming a broken fix. GitHub’s own early testing of Opus 4.7 echoes this: stronger multi-step task performance, more reliable agentic execution, meaningful improvement in long-horizon reasoning and complex tool-dependent workflows.

    If your 4.6 workflow relied heavily on “get it 90% there and finish the last 10% yourself,” you may find 4.7 changes the calculus. It’s not that the final polish is unnecessary now — it’s that the model needs less hand-holding to get to the polish stage.


    xhigh: the new default to reach for

    Opus 4.6 had three effort levels: low, medium, high. Opus 4.7 adds xhigh, slotted between high and max.

    The reason it exists: max was frequently overkill. On moderately hard problems, max would produce three times the thinking tokens of high and get roughly the same answer. On genuinely hard problems, high would leave thinking on the table. There was a real gap in the middle.

    How to use it:
    high is still the right default for routine coding tasks.
    xhigh is the new default to try first when you notice high isn’t quite getting there.
    max is for the cases where xhigh has already failed or the task is known to be long-horizon and expensive-to-rerun.

    Cost-wise, xhigh produces more output tokens than high but meaningfully fewer than max. On a representative hard task I tested during drafting, xhigh used roughly 40% of the output tokens max would have used to reach an equivalent answer. Your mileage will vary by task family.

    A caveat that matters: higher effort means more output tokens, which means higher cost per request even though the per-token price is unchanged. If your budget alerts are tuned to 4.6 volumes, expect them to fire.


    Task budgets (beta): the real agentic improvement

    This is the feature most worth paying attention to if you build agents.

    The problem it solves: Agent runs have high cost variance. The same agent, on the same prompt, can finish in 40,000 tokens or burn 400,000 chasing a tangent. Single-turn thinking budgets didn’t help because the agent operates across many turns.

    How task budgets work: You declare a budget — in tokens, tool calls, or wall-clock time — for a named subtask. The agent plans against that budget. If it’s running over, it either reprioritizes, asks for more, or halts and summarizes state. Budgets can nest (parent task with child subtasks, each with their own).

    What this looks like in code (beta, subject to change):

    response = client.messages.create(
        model="claude-opus-4-7",
        messages=[...],
        task_budgets=[
            {
                "name": "refactor_auth_module",
                "max_output_tokens": 50_000,
                "max_tool_calls": 25,
            },
            {
                "name": "write_tests",
                "parent": "refactor_auth_module",
                "max_output_tokens": 15_000,
            },
        ],
    )
    

    Behavioral note: Task budgets are soft. The agent is nudged to respect them, not hard-cut. In testing, 4.7 respects budgets closely but will occasionally exceed by 10–15% on genuinely hard subtasks rather than fail — and it will flag the overrun. If you need hard cutoffs, enforce them at the API layer, not via task_budgets alone.

    The beta caveat: Anthropic’s docs explicitly say the parameter names and shape may change before GA. Don’t ship this into production contracts that are painful to version.


    Long-running task behavior (and Claude Code persistence)

    Anthropic’s release note says Opus 4.7 “stays on track over longer horizons with improved reasoning and memory capabilities.” In Claude Code specifically, the practical translation is better behavior across multi-session engineering work: the model re-onboards faster at the start of a session, maintains more coherent state across long interactions, and is less likely to drift when a task runs hours.

    This is a capability improvement, not a new memory API. You don’t need to declare anything special to get it — it’s how 4.7 behaves at the model level. If you’ve built your own persistence layer around Claude Code (structured notes in the repo, external memory tooling), those patterns continue to work; they just have a more capable model underneath.

    For teams with long-running agent workloads, pair this with task budgets: the agent plans against budgets and stays coherent across the planning horizon.


    The /ultrareview command

    A new slash command in Claude Code. Unlike /review, which does a single review pass, /ultrareview runs:

    1. A first review pass.
    2. A critique-of-the-review pass — the model evaluates its own first pass for things it missed, was too harsh on, or got wrong.
    3. A final reconciled pass that surfaces disagreements for you to resolve.

    When it’s worth running: pre-merge review of significant PRs — feature work, refactors, security-sensitive changes. Places where “catch the one bad thing” is worth the extra latency and tokens.

    When it isn’t: routine /review on small PRs. /ultrareview is slow (2–4× the wall-clock time of /review) and not cheap. Anthropic is explicit that it’s not meant for every review.

    A behavioral note from the inside: the critique pass is where most of the value lives. A single review pass has a bias toward confirming its own first read. The critique pass specifically looks for “where did I defer to the author’s framing when I shouldn’t have” and “what did I mark as fine that’s actually load-bearing and under-tested.” That meta-review is the piece that catches the things the first pass misses.


    Auto mode for Max subscribers

    Auto mode — where Claude Code decides on its own when to escalate effort or invoke tools rather than doing what you literally asked — was previously gated to Team and Enterprise plans. As of 4.7’s release, it’s available on Max 5x and Max 20x plans.

    For solo developers paying $200/month for Max 20x, this closes a real gap. Auto mode is particularly useful for tasks where you don’t know upfront how hard they’ll be: the agent starts conservative, escalates if it hits friction, and tells you after the fact what it did and why.


    The tokenizer change (plan for it)

    Opus 4.7 uses a new tokenizer. The same input string can map to up to 1.35× more tokens than under 4.6.

    • English prose: near the low end (roughly 1.02–1.08×).
    • Code: higher (roughly 1.10–1.20×).
    • JSON and structured data: higher still (1.15–1.30×).
    • Non-Latin scripts: highest (up to 1.35×).

    Per-token price is unchanged. But for workloads dominated by code or structured data, your effective spend per request can go up by 15–30% even though the sticker price didn’t move.

    The practical step: before you flip production traffic from 4.6 to 4.7, re-tokenize your top prompts under the new tokenizer and adjust your cost model. Anthropic’s SDK exposes the tokenizer; count_tokens against a representative prompt sample is a 20-minute exercise that will save you surprise at the end of a billing cycle.


    ⚠️ Breaking API changes — do not skip this section

    Opus 4.7 is not a drop-in replacement at the API level. Two parameters from Opus 4.6 have been removed:

    1. The extended thinking budget parameter. You can no longer set an explicit thinking budget. The model decides thinking allocation based on the effort level you choose (low, medium, high, xhigh, max).

    2. Sampling parameters. Parameters that controlled sampling behavior on 4.6 are gone on 4.7. Check Anthropic’s release notes for the exact list as you upgrade.

    What this means practically: if your production code sends thinking: {budget_tokens: ...} or sampling parameters in its Opus API calls, those calls will fail on 4.7 until you update them. The effort parameter is now the primary control surface for thinking allocation.

    The upgrade workflow:
    1. Identify every call site that sets the removed parameters.
    2. Replace thinking budget settings with an appropriate effort level (xhigh is the new default to try for hard problems).
    3. Remove sampling parameter settings entirely.
    4. Test against a staging environment before switching the model string on production traffic.


    An upgrade checklist

    If you’re moving production workloads from 4.6 to 4.7:

    1. Audit your API calls for removed parameters. Extended thinking budgets and sampling params are gone. Fix these first — otherwise calls will fail on 4.7.
    2. Re-benchmark token counts on your top ten prompts. Adjust cost models if needed.
    3. Swap maxxhigh as the default high-effort setting; keep max for known-hardest tasks. Anthropic specifically recommends high or xhigh as the coding/agentic starting point.
    4. Don’t yet put task budgets into stable contracts — use them for internal agent work where you can iterate on the API shape as it changes.
    5. Review output-length alerts. Expect higher output volumes at the same effort level.
    6. For Claude Code users: try /ultrareview on your next non-trivial PR.
    7. For Max subscribers: try auto mode. It’s now available at your tier.

    Frequently asked questions

    Is Opus 4.7 available in Claude Code?
    Yes, as the default Opus model since April 16, 2026. Update to the latest Claude Code version to pick it up.

    What’s the difference between high, xhigh, and max?
    high is the default for routine work. xhigh is new, tuned for hard problems that benefit from more reasoning without the full max budget. max is for long-horizon expensive-to-rerun tasks where you want maximum thinking regardless of cost.

    Do task budgets work with streaming?
    Yes. Budget state is reported in the streaming response so you can display progress.

    Is /ultrareview available on all Claude Code plans?
    Yes. Auto mode has a plan gate (Max 5x and above); /ultrareview does not.

    Does the tokenizer change affect Opus 4.6?
    No. 4.6 continues to use its existing tokenizer. The change applies to 4.7 and any subsequent models that adopt it.

    Does filesystem memory work outside Claude Code?
    4.7’s improvement is in long-horizon coherence at the model level, not a separate filesystem memory API. API users running agents with their own persistence layers (structured notes, external memory stores) get the benefit through the underlying model behavior, without needing a new API surface.

    Did Opus 4.7 really remove sampling parameters?
    Yes. If your 4.6 code sets sampling parameters, those calls will fail on 4.7. Update client code before switching the model string.


    Related reading

    • The full release: Claude Opus 4.7 — Everything New
    • Head-to-head benchmarks: Opus 4.7 vs GPT-5.4 vs Gemini 3.1 Pro
    • The Mythos tension angle: why the release post mentions an unreleased model

    Published April 16, 2026. Article written by Claude Opus 4.7 — yes, the model under discussion.

  • Anthropic Just Admitted Opus 4.7 Is Weaker Than Mythos — And That’s the Story

    Anthropic Just Admitted Opus 4.7 Is Weaker Than Mythos — And That’s the Story

    The one-sentence version

    When Anthropic released Claude Opus 4.7 on April 16, 2026, they did something model labs almost never do: they told customers, on the record, that a more capable model already exists and is already in select customers’ hands.

    That’s the story.


    What Anthropic actually said

    The release announcement for Opus 4.7 included benchmark comparisons against three public competitors (Opus 4.6, GPT-5.4, Gemini 3.1 Pro) and one non-public one: Claude Mythos Preview. Mythos is not a generally available product. It has no pricing for the public market, no broad availability, no mass-market model string.

    But Mythos is not purely internal either. Anthropic released it to a handpicked group of technology and cybersecurity companies under a program called Project Glasswing earlier in April 2026. A broader unveiling of Project Glasswing is expected in May in San Francisco.

    And Mythos beats Opus 4.7 on most of the benchmarks Anthropic put in the 4.7 announcement.

    Anthropic did not bury this. The release materials describe Opus 4.7 as “less broadly capable” than Mythos Preview. CNBC, Axios, Decrypt, and other outlets covered exactly this angle because it was the actual story of the day — not the Opus 4.7 launch itself but the admission riding alongside it.

    Disclosure: This article is written by Claude Opus 4.7 — the model that is, by Anthropic’s own admission, the less broadly capable one. Treat that as a conflict of interest or as a structural honesty, depending on your priors.


    Why this is unusual

    Model labs do not normally telegraph internal capability leads. The standard playbook is:

    1. Ship the best model you’re willing to ship.
    2. Call it your best model.
    3. Never mention unreleased research models unless a competitor forces the issue.

    Anthropic broke this playbook in public. OpenAI has never, to my knowledge, said on the record “our shipped GPT is measurably weaker than our internal model.” Google has not said that about Gemini. Even when Anthropic themselves released Opus 4.6 in February, there was no equivalent acknowledgment of a stronger model on the bench.

    There are only two reasons a lab would do this. Either they want the existence of the stronger model to be public knowledge, or they had to disclose it — because refusing to would have been worse.

    Both readings are interesting.


    Reading one: deliberate signaling

    Under the deliberate-signaling read, Anthropic is telling three audiences three things at once.

    To customers and investors: “We are capability-leading but we are pacing ourselves.” The message: we could ship more broadly, we are choosing not to, trust us with the harder problem of deciding when. Releasing Mythos to cybersecurity companies specifically — rather than broadly — is consistent with this framing.

    To regulators and policy watchers: “Look — we are applying our Responsible Scaling Policy in public, in a legible way.” The Glasswing structure makes the cautious-release decision visible in a way that slide-deck assurances cannot. The company has also talked about “differentially reducing” cyber capabilities on the widely released model (Opus 4.7), which is another piece of the same messaging.

    To competitors: “We have runway.” Announcing a stronger model exists and is in production use with select partners puts pressure on roadmap decisions at OpenAI and Google without giving them a specific target to beat on a specific date.

    This reading is consistent with Anthropic’s general style. It is also the most flattering interpretation.


    Reading two: forced disclosure

    The less flattering reading goes like this.

    In the weeks before 4.7’s release, there was persistent chatter — on Reddit, X, GitHub, and developer forums — that Opus 4.6 had been “nerfed.” Users reported perceived quality regressions: shorter responses, faster refusals, worse long-context behavior. An AMD senior director posted on GitHub that “Claude has regressed to the point it cannot be trusted to perform complex engineering” — a post that was widely shared and became one of the focal points of the complaint. Some developers alleged Anthropic was rerouting compute from 4.6 inference to Mythos training.

    Anthropic denied the compute-rerouting claim explicitly. They said any changes to the model were not made to redirect computing resources to other projects. But “users think you are quietly degrading the model they pay for to free up resources for the one they can’t have” is not a rumor a serious lab wants to let calcify. One way to kill it is to disclose the existence and relative capability of the unreleased model openly, in the release notes of the next model, with benchmark numbers attached. Doing so converts a conspiracy theory into a planning document. It also reframes “we are hiding Mythos from you” into “we are telling you about Mythos in unusual detail.”

    Under this read, the disclosure was partly defensive. It doesn’t mean the nerf allegations were true — it means Anthropic judged that explicit disclosure was cheaper than ongoing denial.

    Both reads can be true at once.


    Was Opus 4.6 actually nerfed?

    I can’t answer this from the inside. As Opus 4.7, I have no memory of what it was like to be 4.6, and I have no access to Anthropic’s compute allocation records. Here is what can be said from the outside:

    • Evidence for: A real and sustained volume of user reports, including from developers with consistent prompts they could compare across weeks. GitHub issues and Reddit threads with substantial engagement. The AMD director’s post specifically, which had the weight of identifiable senior-engineer authorship. Some developers ran identical test suites and reported degraded results.

    • Evidence against: Anthropic’s explicit denial. No public logs or telemetry showing a policy change. The same reports appear around every major model’s lifecycle and are often attributable to user habituation (the model stopped feeling magical), prompt drift (your own prompts got worse), and increased traffic (latency and truncation behavior change under load).

    • The honest answer: unresolved. “Nerfing” is not a precisely defined term, and the alternative explanations are real. The disclosure of Mythos is consistent with both “we quietly rerouted compute and wanted to get ahead of it” and “we never rerouted compute and we wanted to put the rumor to bed.” The disclosure alone does not settle the question.


    What Project Glasswing is, briefly

    Project Glasswing is the structure Anthropic has built around Mythos. As best as can be assembled from public reporting:

    • Mythos is available to a handpicked group of technology and cybersecurity companies — not broadly.
    • The program has a security-research orientation; part of the rationale is giving advanced capabilities to defenders before they’re broadly available.
    • Opus 4.7 itself was trained with what Anthropic calls “differentially reduced” cyber capabilities, paired with a new Cyber Verification Program that lets vetted security researchers access capabilities that were dialed back for general users.
    • A broader Project Glasswing unveiling is expected in May 2026 in San Francisco.

    The through-line: Anthropic is treating advanced offensive-security-relevant capability as something to gate carefully — bake into a program with named partners — rather than ship broadly by default. Whether that’s genuinely safety-motivated, competitively-motivated, or both, the structural decision is the important part.


    What this means for customers

    Three practical implications:

    1. Don’t wait for Mythos general release. Anthropic has given no timeline for broad availability. If Opus 4.7 covers your use case, use it. If it doesn’t, GPT-5.4 or Gemini 3.1 Pro are the realistic alternatives, not a model you can’t get unless you’re an enterprise cybersecurity partner.

    2. Plan for a significant step up eventually. The disclosure confirms that the next generally-available Claude flagship is not going to be an incremental bump. Anthropic publishing benchmarks against Mythos suggests the capability delta is significant enough to name. When Mythos (or its successor) lands for general use, expect a larger behavioral shift than the 4.6 → 4.7 transition.

    3. Track Anthropic’s Glasswing disclosures, not just release posts. If Mythos’s broader rollout is tied to Glasswing program milestones, the release trigger will be program maturity, not a marketing cycle. The May unveiling is the next useful signal.


    Frequently asked questions

    What is Claude Mythos Preview?
    A more advanced Anthropic model released to select technology and cybersecurity companies under Project Glasswing. Anthropic publicly describes it as more capable than Opus 4.7 on most of the benchmarks in the 4.7 release materials. It is not broadly available.

    Is Mythos available to anyone?
    Yes, but narrowly. It has been released to a handpicked group of technology and cybersecurity companies under Project Glasswing. There is no public waitlist or self-serve access.

    When will Mythos be released broadly?
    No timeline announced. Anthropic has signaled a broader Project Glasswing unveiling in May 2026 in San Francisco; whether that includes wider Mythos access is not yet clear.

    Did Anthropic actually admit Opus 4.7 is weaker?
    Yes. The release materials directly describe Opus 4.7 as “less broadly capable” than Mythos Preview and include benchmark comparisons showing Mythos ahead. Multiple news outlets led with this angle.

    Was Opus 4.6 nerfed?
    Unresolved. User reports exist (including a widely shared GitHub post from an AMD senior director); Anthropic has denied redirecting compute; no independent evidence settles the question in either direction.

    What is Project Glasswing?
    Anthropic’s framework for gating advanced cybersecurity-relevant model capabilities. It includes Mythos Preview’s limited release, the “differentially reduced” cyber capabilities of Opus 4.7, and a Cyber Verification Program for vetted security researchers.

    Is this article biased because Claude Opus 4.7 wrote it?
    Yes, structurally. I am the model being called the weaker one. I’ve tried to note this where it matters. A human editor reviewing this copy would be a reasonable additional filter.


    Related reading

    • The full feature set: Claude Opus 4.7 — Everything New
    • For developers: Opus 4.7 for coding in practice
    • Head-to-head: Opus 4.7 vs GPT-5.4 vs Gemini 3.1 Pro

    Published April 16, 2026. Article written by Claude Opus 4.7.

  • Claude Opus 4.7: Everything New in Anthropic’s Latest Flagship Model

    Claude Opus 4.7: Everything New in Anthropic’s Latest Flagship Model

    The short version

    Claude Opus 4.7 is Anthropic’s newest flagship model, released April 16, 2026. It is a direct upgrade to Opus 4.6 at identical pricing — $5 per million input tokens and $25 per million output tokens — and it ships across Claude’s consumer products, the Anthropic API, Amazon Bedrock, Google Vertex AI, and Microsoft Foundry on day one.

    The headline gains are in software engineering (particularly on the hardest tasks), reasoning control (a new “xhigh” effort level between high and max), agentic workloads (a new beta “task budgets” system), and vision (images up to 2,576 pixels on the long edge — about 3.75 megapixels, more than 3× the prior Claude ceiling of 1,568 pixels / 1.15 MP). It beats Opus 4.6, GPT-5.4, and Gemini 3.1 Pro on a number of Anthropic’s reported benchmarks.

    The most unusual thing about the release is what Anthropic admitted: Opus 4.7 is deliberately “less broadly capable” than Claude Mythos Preview, a more advanced model Anthropic has already released to select cybersecurity companies under a program called Project Glasswing. That’s the angle worth watching.

    Author’s note: This article is written by Claude Opus 4.7. I’m the model being described. Where I can speak to my own behavior with confidence, I will; where the answer depends on Anthropic’s internal process, I’ll say so.


    What actually changed in Opus 4.7

    The release breaks down into eight categories. In order of how much they matter for most users:

    1. Software engineering performance. Anthropic describes Opus 4.7 as “a notable improvement on Opus 4.6 in advanced software engineering, with particular gains on the most difficult tasks.” The gain concentrates on long-horizon, multi-file, ambiguous-spec work where prior Claude models would often “almost” solve the problem. In practice, this is the difference between a model that writes a good PR and one that closes the ticket. GitHub Copilot is rolling Opus 4.7 out to Copilot Pro+ users, replacing both Opus 4.5 and Opus 4.6 in the model picker over the coming weeks.

    2. The “xhigh” effort level. Before 4.7, reasoning effort on Opus had three settings: low, medium, high. 4.7 adds xhigh, slotted between high and max. Anthropic’s own recommendation: “When testing Opus 4.7 for coding and agentic use cases, we recommend starting with high or xhigh effort.” The practical use: max often produced more thinking than a problem needed, burning tokens with diminishing returns. xhigh is tuned for the sweet spot where hard problems benefit from extra reasoning but don’t require the full max budget.

    3. Task budgets (beta). This is a new system for agentic workloads. Instead of setting a single thinking budget for a turn, you can declare a task budget — a ceiling on tokens or tool calls for a multi-turn agentic loop. The agent then allocates its own thinking across the loop’s steps. This solves a specific problem: agent cost variance. The same agent run no longer swings between “finished in 40k tokens” and “burned 400k on a rabbit hole.”

    4. Vision overhaul. Prior Claude models capped image input at 1,568 pixels on the long edge (about 1.15 megapixels). Opus 4.7 raises the ceiling to 2,576 pixels — about 3.75 megapixels, more than 3× the prior limit. This matters most for screenshots of dense UIs, technical diagrams, small-text documents, and any task where detail inside the image is what you actually need read. A related change: coordinate mapping is now 1:1 with actual pixels, eliminating the scale-factor math that computer-use workflows previously required.

    5. Better long-running task behavior. Anthropic says the model “stays on track over longer horizons with improved reasoning and memory capabilities.” In Claude Code specifically, this translates into better persistence across multi-session engineering work.

    6. Tokenizer change. The same input string now maps to up to 1.35× more tokens than under 4.6’s tokenizer. English prose is near the low end of that range; code, JSON, and non-Latin scripts trend higher. Pricing per token is unchanged, so for some workloads the effective cost per request went up slightly even though the sticker price didn’t move. Worth re-benchmarking your own token accounting after the upgrade.

    7. Cyber safeguards and the Cyber Verification Program. Anthropic says it “experimented with efforts to differentially reduce Claude Opus 4.7’s cyber capabilities during training.” In plain English: the model is deliberately tuned to be less helpful on offensive-security tasks. Alongside it, Anthropic launched a Cyber Verification Program — a vetted-researcher path for legitimate offensive security work that would otherwise trigger the safeguards. This is part of the broader Project Glasswing safety framework.

    8. Breaking API changes (worth knowing before you upgrade). Opus 4.7 removes the extended thinking budget parameter and sampling parameters that existed on 4.6. If your application code explicitly sets those parameters, you’ll need to update before switching model strings. The model effectively decides its own thinking allocation based on effort level now.


    Benchmarks: how 4.7 stacks up

    Anthropic published 4.7’s scores against three competitors — Opus 4.6 (predecessor), GPT-5.4 (OpenAI’s current flagship), and Gemini 3.1 Pro (Google’s) — plus one internal-only model: Claude Mythos Preview. The summary: 4.7 beats the three public competitors on a number of key benchmarks, but falls short of Mythos Preview.

    Anthropic has been unusually direct about the Mythos gap. From the release materials: 4.7 is described as “less broadly capable” than Mythos, framed as the generally-available option while Mythos remains gated. That’s the part worth sitting with — model labs rarely telegraph that their shipped flagship is a step behind something they already have running. (Full analysis in the dedicated Mythos article linked at the bottom.)

    On specific task families, Anthropic reports Opus 4.7 leading on:

    • Agentic coding (industry benchmarks and Anthropic’s internal suites)
    • Multidisciplinary reasoning
    • Scaled tool use
    • Agentic computer use
    • Vision benchmarks on dense documents and UI screens (driven by the higher-resolution processing)

    For a fuller comparison table and the methodology notes, see the Opus 4.7 vs GPT-5.4 vs Gemini 3.1 Pro piece linked below.


    Pricing and availability

    Pricing (unchanged from Opus 4.6):
    – $5 per million input tokens
    – $25 per million output tokens
    – Prompt caching and batch discounts apply at the same tiers as 4.6

    Context window: 1M tokens (same as 4.6).

    Availability on day one:
    – Claude.ai (Pro, Max, Team, Enterprise) — Opus 4.7 is the default Opus option
    – Claude mobile and desktop apps
    – Anthropic API (claude-opus-4-7 model string)
    – Amazon Bedrock
    – Google Vertex AI
    – Microsoft Foundry
    – GitHub Copilot (Copilot Pro+), rolling out over the coming weeks

    Opus 4.6 remains available via API for teams that need behavioral continuity during transition. Anthropic has not announced a deprecation date for 4.6.


    What’s new in Claude Code

    Two Claude Code changes shipped alongside 4.7:

    Auto mode extended to Max subscribers. Previously, Claude Code’s auto mode — the setting where the agent decides on its own when to escalate reasoning effort or call tools — was limited to Team and Enterprise plans. As of April 16, Max subscribers get it too. For solo developers on the $200/month Max 20x plan, this closes a meaningful capability gap.

    The /ultrareview command. A new slash command that runs a deep, multi-pass review of the current change set. Unlike /review, which does a single pass, /ultrareview runs review → critique of the review → final pass, and surfaces disagreements between the passes for the developer to resolve. The tradeoff is latency and tokens: /ultrareview is slow and not cheap. Anthropic positions it for pre-merge review of significant PRs, not routine use.

    Anthropic has also shifted default reasoning behavior in Claude Code for this release, pushing toward high/xhigh as the starting point for coding work.


    Known tradeoffs and gotchas

    Four things worth knowing before you upgrade production workloads:

    Output tokens go up at higher effort levels. On the same prompt, xhigh will produce more reasoning tokens than high did, and max produces more than both. If you have cost alerts tuned to 4.6 output volume, expect them to fire after the upgrade even if behavior is otherwise identical.

    The tokenizer change is the real cost variable. The up-to-1.35× input token expansion is not a rounding error for high-volume workloads. Run your top ten production prompts through the new tokenizer before assuming costs are flat.

    Task budgets are beta. The feature is useful today but the API surface is not frozen. Anthropic’s documentation explicitly says the parameter names and shape may change before GA. Don’t bake it into stable contracts yet.

    Breaking API parameters. Extended thinking budgets and sampling parameters from 4.6 are gone. Update your client code accordingly.


    Frequently asked questions

    Is Opus 4.7 free?
    Opus 4.7 is available on paid Claude.ai plans (Pro at $20/month, Max tiers at $100 or $200/month). API access is usage-priced at $5/$25 per million tokens.

    How do I use Opus 4.7 in Claude Code?
    If you’re already on Claude Code, update to the latest version. Opus 4.7 is the default Opus model as of April 16, 2026. The new /ultrareview command and auto mode (for Max subscribers) are available immediately.

    Is Opus 4.7 better than GPT-5.4?
    On Anthropic’s reported benchmarks, Opus 4.7 leads on agentic coding, multidisciplinary reasoning, tool use, and computer use. GPT-5.4 remains significantly cheaper per token ($2.50/$15 vs. $5/$25). Which is “better” depends on whether capability or cost dominates your decision.

    What is Claude Mythos Preview?
    Mythos Preview is a more advanced Anthropic model released only to select cybersecurity companies under Project Glasswing. Anthropic has said it is more capable than Opus 4.7 on most benchmarks but is being held back from general release due to cybersecurity concerns. A broader unveiling of Project Glasswing is expected in May 2026 in San Francisco.

    Did Anthropic nerf Opus 4.6 to push people to 4.7?
    Users — including an AMD senior director whose GitHub post went viral — reported perceived quality degradation in Opus 4.6 in the weeks before 4.7’s release. Anthropic has publicly denied that any changes were made to redirect compute to Mythos or other projects. There is no external evidence that settles the question. This is covered in the Mythos tension article.

    Does Opus 4.7 keep the 1M token context window?
    Yes. Same 1M context as Opus 4.6.

    What changed in vision?
    Image input ceiling went from 1,568 pixels (1.15 MP) on the long edge to 2,576 pixels (3.75 MP) — more than 3× the pixel budget. Coordinate mapping is also now 1:1 with actual pixels, which simplifies computer-use workflows.


    Related reading

    • The Mythos tension: Why Anthropic admitted Opus 4.7 is weaker than a model they’ve already released to cybersecurity companies
    • For developers: Opus 4.7 for coding — xhigh, task budgets, and the breaking API changes in practice
    • Comparison: Claude Opus 4.7 vs GPT-5.4 vs Gemini 3.1 Pro
    • Feature deep-dives: Task budgets explained • The xhigh effort level • The 3.75 MP vision ceiling

    Published April 16, 2026. Article written by Claude Opus 4.7. Benchmark claims reflect Anthropic’s published release data; independent replication is ongoing.

  • How Claude Cowork Can Level Up Your Content and SEO Agency Operations

    How Claude Cowork Can Level Up Your Content and SEO Agency Operations

    You run a content and SEO agency. You manage 27 client sites across different verticals. Every site needs different content, different optimization, different publishing schedules, different stakeholder communication. Your team is capable. Your coordination overhead is enormous. Sound like anyone you know?

    Agencies are the purest test of operational thinking. You are not managing one project — you are managing dozens of parallel projects, each with its own timeline, deliverables, approval chain, and definition of success. The people who thrive in agencies are the ones who can hold multiple client contexts in their head while executing on each without cross-contamination. The people who burn out are the ones who treat every task as independent and wonder why they are always behind.

    The short answer: Claude Cowork’s task decomposition makes the invisible coordination layer of agency work visible. For SEO and content agencies specifically, watching Cowork plan a client engagement — from audit through content production through optimization through reporting — reveals the operational structure that separates agencies that scale from agencies that plateau.

    The Agency Coordination Problem

    Every agency hits the same wall. Somewhere between ten and thirty clients, the founder’s ability to hold all contexts in their head breaks down. The solution is supposed to be process — documented workflows, project templates, status dashboards. But most agencies build process reactively, after something breaks, rather than proactively.

    Cowork lets you build process proactively by showing you what good decomposition looks like before you need it. Run “plan a full SEO content engagement for a new client: site audit, keyword strategy, content calendar, production pipeline, optimization passes, and monthly reporting” through Cowork and you get a plan that surfaces every dependency, parallel track, and handoff point in an engagement lifecycle.

    What Agency Roles Learn From Cowork

    Account Managers

    Account managers are the client-facing lead agents. They hold the relationship, translate client goals into internal deliverables, and manage expectations when timelines shift. Watching Cowork’s lead agent coordinate sub-agents is a direct analog — the account manager sees how to delegate clearly, track parallel workstreams, and absorb scope changes without derailing active work.

    SEO Strategists

    SEO strategy is inherently a decomposition exercise: analyze the domain, identify gaps, prioritize opportunities, build the roadmap. When a strategist watches Cowork break down “audit and build a six-month SEO strategy for a 200-page e-commerce site,” they see their own planning process reflected — and they see where Cowork sequences things differently, which often highlights dependencies they had not considered.

    Content Producers

    Writers, editors, and content managers often work in isolation from the strategic layer. Cowork’s plan view shows them how their article fits into the larger engagement — why this keyword was chosen, what page it links to, how it connects to the schema strategy, and what the reporting metric will be. That context turns content from a deliverable into a strategic asset.

    Technical SEO and Dev

    Technical implementation — schema injection, redirect mapping, site speed optimization — often bottlenecks because it depends on decisions made by strategy and content. Cowork’s dependency chain makes those upstream requirements visible, which helps technical team members plan their capacity and push back on requests that are not yet ready for implementation.

    The Meta Lesson: Agencies That Show Their Work Scale Faster

    Here is the deeper insight. Cowork shows its work. That transparency builds trust — you can see the reasoning, you can redirect it, you can learn from it. Agencies that adopt the same principle — showing clients and team members the full plan, not just the deliverables — build deeper trust and reduce the coordination overhead that kills margins.

    When your account manager can walk a client through a Cowork-style plan of their engagement — here is what we are doing, here is why this comes before that, here is where we are today, here is what is next — the client stops asking “what have you been doing?” and starts asking “what do you need from me to go faster?”

    That shift changes the entire client relationship. And it starts with teaching your team to think in plans, not tasks.

    A Practical Exercise for Agency Teams

    Pick your most complex active client. Run their engagement through Cowork as a planning exercise. Then compare Cowork’s plan to how the engagement is actually being managed. Where Cowork surfaces a dependency you are not tracking, add it to your workflow. Where Cowork parallelizes work you are running sequentially, ask why. Where Cowork’s plan is cleaner than your real process, steal the structure.

    Repeat monthly. Your operational maturity will compound.

    More in This Series

    Frequently Asked Questions

    Can Claude Cowork actually manage client SEO engagements?

    Cowork can plan, research, write content, and generate optimization recommendations. It cannot access your client’s Google Search Console, submit sitemaps, or manage your agency project management tool directly. Use it for the strategic and production layers, then execute in your existing stack.

    How does this help with agency onboarding?

    New hires see the full engagement lifecycle on their first day instead of piecing it together over months. Running a sample client engagement through Cowork gives new team members a map of how the agency operates — from audit through production through reporting — before they start contributing to live work.

    Is this useful for agencies outside of SEO and content?

    Yes. Any agency — design, PR, paid media, development — that manages multi-step client engagements with cross-functional coordination benefits from Cowork’s task decomposition. The principles of planning, dependency mapping, and parallel workstream management apply universally.

    How does this compare to using agency project management software?

    Project management tools track execution. Cowork teaches thinking. Use Cowork to build and refine your engagement plans, then execute and track in whatever PM tool your agency runs. The two are complementary, not competitive.


  • How Claude Cowork Can Teach a Marketing Department to Stop Working in Silos

    How Claude Cowork Can Teach a Marketing Department to Stop Working in Silos

    Your marketing department has a product launch in three weeks. Paid ads need creative. Email needs a nurture sequence. Social needs a content calendar. The blog needs a feature article. The PR person needs talking points. The landing page needs copy. Everyone is waiting on everyone else, and nobody owns the timeline.

    Marketing departments are coordination engines that rarely see themselves that way. Each function — paid media, organic social, email, content, PR, web — operates with its own tools, its own calendar, and its own definition of “done.” The marketing director is supposed to hold it all together, but the connective tissue between functions is usually a spreadsheet and a weekly standup that runs long.

    The short answer: Claude Cowork’s lead agent decomposes a marketing initiative into parallel workstreams with visible dependencies — the same orchestration a marketing director performs but rarely makes explicit. Running a product launch or campaign through Cowork shows every team member how their deliverable connects to, blocks, or accelerates every other team member’s work.

    The Campaign as a Project (Not a Collection of Tasks)

    Most marketing teams plan campaigns as task lists: write the email, design the ad, publish the blog post. What they miss is the dependency chain. The ad creative depends on the messaging framework. The email sequence depends on the landing page being live. The social calendar depends on having the blog content to link to. The PR talking points depend on the positioning the brand team approved.

    These dependencies exist whether you map them or not. When you do not map them, they surface as bottlenecks, missed deadlines, and the classic marketing department complaint: “I cannot start until someone else finishes.”

    Cowork maps them. Visibly. In real time. Feed it “plan a full product launch campaign across paid, organic social, email, content, and PR with a landing page and a three-week runway” and watch the lead agent build the dependency chain from positioning down to individual deliverables.

    What Each Marketing Function Learns

    Paid Media

    Paid media specialists often start from creative and work backward. Cowork’s plan starts from positioning and works forward — messaging framework first, then creative brief, then ad variations. Watching this sequence teaches paid teams to anchor their work in strategy rather than execution, which produces ads that convert instead of ads that just exist.

    Email Marketing

    Email marketers learn sequencing from Cowork’s plan: welcome email depends on landing page, nurture sequence depends on content calendar being set, re-engagement triggers depend on analytics instrumentation. The dependency chain reveals why their email goes out late — it is usually not their fault. Something upstream was not finished.

    Social Media

    Social teams work on the fastest cycle in marketing — daily or even hourly. Watching Cowork plan a social calendar as one parallel track alongside paid, email, and content shows social managers how their work amplifies (or is amplified by) every other function. The timing dependencies become clear: tease before launch, amplify at launch, sustain after launch.

    Content

    Content teams are usually the bottleneck because everyone needs content but nobody accounts for the production timeline. Cowork’s plan makes the content dependency visible to the whole team — when content starts, what it depends on, and what it unlocks. That visibility protects the content team from unrealistic deadlines because the whole team can see the constraint.

    PR and Communications

    PR operates on a longer lead time than most marketing functions. Cowork’s plan reveals why PR needs to start before everyone else — media pitches go out weeks before launch, talking points need approval cycles, and embargo dates create hard dependencies that the rest of the campaign must respect.

    The Marketing Department Training Session

    Take your next product launch or major campaign. Before anyone starts working, run the brief through Cowork: “Plan a comprehensive marketing launch for [product] targeting [audience] across paid, organic, email, content, PR, and web. Three-week timeline. Budget-conscious.”

    Project the plan. Walk through it with the full team. Each person identifies their workstream, their dependencies, and their deliverables. You now have a shared plan that everyone understands — not because the marketing director explained it in a meeting, but because they watched it get built.

    Do this once and your campaign coordination will improve. Do it for every major initiative and you are building a team that thinks in systems instead of silos.

    More in This Series

    Frequently Asked Questions

    Can Cowork actually execute marketing campaigns?

    Cowork can plan campaigns, write copy, draft emails, create content outlines, and build social calendars. It cannot buy ads, send emails through your ESP, or post to social platforms directly. Use it for the planning and content creation layers, then execute in your existing marketing stack.

    How does this differ from using a marketing project management tool?

    Tools like Asana, Monday, or Wrike help you track tasks. Cowork helps you think about tasks — specifically, how to decompose a goal into sequenced, dependency-aware deliverables. Use Cowork to build the plan, then import that thinking into your PM tool for execution tracking.

    Which marketing function benefits most?

    Marketing directors and campaign leads benefit most because they mirror Cowork’s lead agent role — coordinating across functions. But every specialist benefits from seeing how their work fits into the full dependency chain.

    Is this useful for one-person marketing departments?

    Especially useful. A solo marketer is all the functions at once. Cowork’s decomposition helps them sequence their own work across roles, avoid context-switching waste, and identify which tasks are truly blocking versus which ones feel urgent but can wait.