Author: Will Tygart

  • Output Compression Discipline: Concentrated Slices vs Full Meals

    Output Compression Discipline: Concentrated Slices vs Full Meals

    Last refreshed: May 15, 2026

    Most Claude cost analyses focus on input tokens — the knowledge you send in. The underappreciated lever is output compression. Claude is trained to be thorough. Left unconstrained, it produces full meals: preambles, recaps, hedges, transition sentences, closing summaries. All of those tokens cost money. All of them are often unnecessary. Output discipline — getting Claude to deliver concentrated slices instead of full meals — is often the highest-leverage cost reduction available without changing models or switching to async.

    This is part of the Claude on a Budget series. For input-side compression, see The Cold-Start Problem. For pricing mechanics, see Prompt Caching.

    The Default Verbosity Problem

    Ask Claude to “summarize this document” without constraints and you will get: an opening sentence restating the task, a multi-paragraph summary, a bullet-point recap of the summary, and a closing note about what was not covered. The actual information density — insight per token — is low. You paid for 800 tokens of output and needed 150. Multiply across thousands of API calls and you have built a significant cost leak from default model behavior, not from bad prompts.

    The Output Compression Toolkit

    1. Explicit word and token caps in the prompt. “Respond in 150 words or fewer” is the single most effective instruction for reducing output tokens. Claude respects tight limits. “Be concise” does not work reliably. “150 words maximum” does. For JSON outputs: “Respond with only valid JSON, no markdown fences, no explanation.” Every word of instruction about format is recovered 10x in output reduction across repeated calls.

    2. Structured output schemas. When you need structured data, define the exact JSON schema. Claude stops generating prose and fills fields. You get exactly what you specified and nothing more. The token reduction versus free-form responses is typically 40-70% for equivalent information content.

    # Free-form -- verbose, unpredictable length
    prompt_verbose = "Summarize the key points of this article and their implications."
    
    # Structured -- tight, predictable, cheaper
    prompt_structured = """Extract from this article:
    {"headline": "string", "key_points": ["string", "string", "string"], "sentiment": "positive|neutral|negative"}
    Respond with valid JSON only. No explanation."""

    3. Role-based compression priming. System prompt framing shapes output length. “You are a precise technical writer who values brevity. Never restate the task. Deliver the answer directly.” produces consistently shorter outputs than a neutral system prompt. This is prompt engineering for token economics, not just quality.

    4. Chained micro-tasks over monolithic requests. Instead of asking Claude to research, analyze, synthesize, and format in one prompt, chain smaller requests. Each call is scoped to one task with tight output constraints. Total tokens across the chain are often lower than a single unconstrained request, and intermediate outputs are cacheable — pairing naturally with the prompt caching strategy.

    The Notion Second Brain Application

    The operational implementation at Tygart Media runs this pattern at pipeline level. The Notion second brain eliminates the need for Claude to generate background context — it already exists in structured form. Extractions from Notion arrive as pre-formatted knowledge blocks. Claude’s task is synthesis over existing structured data, not open-ended research and explanation. Output prompts are scoped: “Given this structured data, write a 400-word section for [topic]. No preamble, no conclusion, begin directly with the first point.” The output is a concentrated slice — dense, usable, billable at a fraction of what free-form generation costs for equivalent value.

    Measuring Compression Effectiveness

    Track output_tokens in your API responses. Log them per prompt template. Identify your highest-output templates and run compression interventions — tighter word caps, structured formats, role priming. The target is information density: insight delivered per output token, not raw token count. A 500-token output with 3 actionable insights beats a 200-token output with 1. Compression discipline is about removing the scaffolding (preambles, hedges, recaps) while preserving the load-bearing structure (insight, data, instruction).

    max_tokens as a Hard Ceiling

    Set max_tokens conservatively in your API calls. This is your financial guardrail, not just a model parameter. For classification tasks: 50 tokens. For short summaries: 200 tokens. For structured JSON extraction: 500 tokens. For article drafts: 1,500-2,000 tokens. Leaving max_tokens at the model default (4,096-8,192) on every call is leaving a cost ceiling unjustifiably high. Claude will rarely hit the ceiling on constrained tasks, but it prevents runaway generation on edge-case inputs that can quietly inflate your bill.

    Next: Per-Model Content Shaping: Write Less, Get Cited More →

  • Prompt Caching: How to Cut Repeated Context Costs by Up to 90%

    Prompt Caching: How to Cut Repeated Context Costs by Up to 90%

    Last refreshed: May 15, 2026

    If you’re sending the same large block of context — a knowledge base, a style guide, a long system prompt, a reference document — with every Claude request, you’re paying full input token rate on every single call. Anthropic’s prompt caching collapses that to roughly 10% of the standard input rate for cache hits. For context blocks of 1,000+ tokens sent repeatedly, this is one of the most reliable cost levers available.

    This is part of the Claude on a Budget series. For async workloads, see The Batch API: 50% Off for Non-Urgent Work. For token reduction before the API call, see The Cold-Start Problem: Second Brain and CLAUDE.md.

    How the Cache Works

    Prompt caching is prefix-based. Anthropic caches the exact token sequence up to your cache_control breakpoint. Any subsequent request that begins with that identical prefix hits the cache and pays cache read rate (~$0.30/M for Sonnet vs. $3.00/M standard — a 90% reduction). The cache is maintained for approximately 5 minutes of inactivity, with extended TTL options for longer-lived contexts. Token minimums apply: 1,024 tokens for Haiku, 2,048 for Sonnet and Opus.

    The Pricing Reality

    ModelStandard InputCache WriteCache ReadRead Savings
    Haiku 4.5$1.00/M$1.25/M$0.10/M90%
    Sonnet 4.6$3.00/M$3.75/M$0.30/M90%
    Opus 4.7$5.00/M$6.25/M$0.50/M90%

    Cache writes cost slightly more than standard input (25% premium). The break-even is the second hit. Every cache read after that is 90% cheaper. For any context block read more than once, caching wins.

    The Implementation

    import anthropic
    
    client = anthropic.Anthropic()
    
    # Large system prompt or knowledge base -- mark for caching
    SYSTEM_CONTEXT = "Your 5,000-token knowledge base or style guide here..."
    
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        system=[
            {
                "type": "text",
                "text": SYSTEM_CONTEXT,
                "cache_control": {"type": "ephemeral"}
            }
        ],
        messages=[
            {"role": "user", "content": "Your specific question or task here"}
        ]
    )
    
    # Check cache performance in usage stats
    usage = response.usage
    print(f"Input tokens: {usage.input_tokens}")
    print(f"Cache creation tokens: {usage.cache_creation_input_tokens}")
    print(f"Cache read tokens: {usage.cache_read_input_tokens}")

    On the first call, you will see cache_creation_input_tokens populated — that is the write. On subsequent calls with the same prefix, cache_read_input_tokens shows what was served from cache at 10% cost.

    What to Cache

    Anything large, stable, and repeated qualifies. The highest-value candidates: long system prompts defining agent behavior or personas; reference documents such as product specs, legal terms, or knowledge bases; few-shot example sets (10+ examples add up fast); conversation history in multi-turn applications where you mark the stable history prefix for caching and leave only the new turn uncached. In agentic pipelines where Claude processes a document repeatedly across multiple analysis passes, cache the document body and vary only the instruction. You pay full rate once per document, cache rate on every subsequent pass.

    The CLAUDE.md and Second Brain Connection

    The cold-start reduction strategy covered in Cluster 1 works because you are reducing what gets sent, not how it gets priced. Prompt caching is the complement: for context that must be sent, cache it. Together, the disciplines compound — your CLAUDE.md file keeps context lean; prompt caching ensures whatever you do send repeatedly costs 90% less after the first hit.

    Cache Design Principles

    Place stable content at the front of your prompt. Place dynamic content at the end. The cache key is prefix-matched, so any change to content before the cache_control marker invalidates the cache. If your system prompt has a dynamic date or session ID embedded at the top, you are guaranteeing cache misses on every call. The correct structure: static knowledge base, then cache marker, then dynamic task-specific instruction. Never the reverse.

    You can set up to 4 cache breakpoints in a single request for granular control. Most production implementations need only one or two — the system prompt cache and optionally a conversation history cache in long multi-turn sessions.

    Operational Impact

    Teams running Claude with consistent system prompts across many requests report effective input costs dropping to near-cache-read rates after the first request. For a content pipeline running 1,000 daily requests with a 5,000-token system prompt: uncached cost is $15/day on input alone at Sonnet rates ($3/M). Cached cost after the first request: $1.50/day. That is $13.50/day saved on system prompt tokens alone — $4,900/year from one implementation change that takes an afternoon to ship.

    Next: Output Compression Discipline: Concentrated Slices vs Full Meals →