Most Claude cost analyses focus on input tokens — the knowledge you send in. The underappreciated lever is output compression. Claude is trained to be thorough. Left unconstrained, it produces full meals: preambles, recaps, hedges, transition sentences, closing summaries. All of those tokens cost money. All of them are often unnecessary. Output discipline — getting Claude to deliver concentrated slices instead of full meals — is often the highest-leverage cost reduction available without changing models or switching to async.
This is part of the Claude on a Budget series. For input-side compression, see The Cold-Start Problem. For pricing mechanics, see Prompt Caching.
The Default Verbosity Problem
Ask Claude to “summarize this document” without constraints and you will get: an opening sentence restating the task, a multi-paragraph summary, a bullet-point recap of the summary, and a closing note about what was not covered. The actual information density — insight per token — is low. You paid for 800 tokens of output and needed 150. Multiply across thousands of API calls and you have built a significant cost leak from default model behavior, not from bad prompts.
The Output Compression Toolkit
1. Explicit word and token caps in the prompt. “Respond in 150 words or fewer” is the single most effective instruction for reducing output tokens. Claude respects tight limits. “Be concise” does not work reliably. “150 words maximum” does. For JSON outputs: “Respond with only valid JSON, no markdown fences, no explanation.” Every word of instruction about format is recovered 10x in output reduction across repeated calls.
2. Structured output schemas. When you need structured data, define the exact JSON schema. Claude stops generating prose and fills fields. You get exactly what you specified and nothing more. The token reduction versus free-form responses is typically 40-70% for equivalent information content.
# Free-form -- verbose, unpredictable length
prompt_verbose = "Summarize the key points of this article and their implications."
# Structured -- tight, predictable, cheaper
prompt_structured = """Extract from this article:
{"headline": "string", "key_points": ["string", "string", "string"], "sentiment": "positive|neutral|negative"}
Respond with valid JSON only. No explanation."""
3. Role-based compression priming. System prompt framing shapes output length. “You are a precise technical writer who values brevity. Never restate the task. Deliver the answer directly.” produces consistently shorter outputs than a neutral system prompt. This is prompt engineering for token economics, not just quality.
4. Chained micro-tasks over monolithic requests. Instead of asking Claude to research, analyze, synthesize, and format in one prompt, chain smaller requests. Each call is scoped to one task with tight output constraints. Total tokens across the chain are often lower than a single unconstrained request, and intermediate outputs are cacheable — pairing naturally with the prompt caching strategy.
The Notion Second Brain Application
The operational implementation at Tygart Media runs this pattern at pipeline level. The Notion second brain eliminates the need for Claude to generate background context — it already exists in structured form. Extractions from Notion arrive as pre-formatted knowledge blocks. Claude’s task is synthesis over existing structured data, not open-ended research and explanation. Output prompts are scoped: “Given this structured data, write a 400-word section for [topic]. No preamble, no conclusion, begin directly with the first point.” The output is a concentrated slice — dense, usable, billable at a fraction of what free-form generation costs for equivalent value.
Measuring Compression Effectiveness
Track output_tokens in your API responses. Log them per prompt template. Identify your highest-output templates and run compression interventions — tighter word caps, structured formats, role priming. The target is information density: insight delivered per output token, not raw token count. A 500-token output with 3 actionable insights beats a 200-token output with 1. Compression discipline is about removing the scaffolding (preambles, hedges, recaps) while preserving the load-bearing structure (insight, data, instruction).
max_tokens as a Hard Ceiling
Set max_tokens conservatively in your API calls. This is your financial guardrail, not just a model parameter. For classification tasks: 50 tokens. For short summaries: 200 tokens. For structured JSON extraction: 500 tokens. For article drafts: 1,500-2,000 tokens. Leaving max_tokens at the model default (4,096-8,192) on every call is leaving a cost ceiling unjustifiably high. Claude will rarely hit the ceiling on constrained tasks, but it prevents runaway generation on edge-case inputs that can quietly inflate your bill.
Next: Per-Model Content Shaping: Write Less, Get Cited More →

Leave a Reply