If you’re sending the same large block of context — a knowledge base, a style guide, a long system prompt, a reference document — with every Claude request, you’re paying full input token rate on every single call. Anthropic’s prompt caching collapses that to roughly 10% of the standard input rate for cache hits. For context blocks of 1,000+ tokens sent repeatedly, this is one of the most reliable cost levers available.
This is part of the Claude on a Budget series. For async workloads, see The Batch API: 50% Off for Non-Urgent Work. For token reduction before the API call, see The Cold-Start Problem: Second Brain and CLAUDE.md.
How the Cache Works
Prompt caching is prefix-based. Anthropic caches the exact token sequence up to your cache_control breakpoint. Any subsequent request that begins with that identical prefix hits the cache and pays cache read rate (~$0.30/M for Sonnet vs. $3.00/M standard — a 90% reduction). The cache is maintained for approximately 5 minutes of inactivity, with extended TTL options for longer-lived contexts. Token minimums apply: 1,024 tokens for Haiku, 2,048 for Sonnet and Opus.
The Pricing Reality
| Model | Standard Input | Cache Write | Cache Read | Read Savings |
|---|---|---|---|---|
| Haiku 4.5 | $1.00/M | $1.25/M | $0.10/M | 90% |
| Sonnet 4.6 | $3.00/M | $3.75/M | $0.30/M | 90% |
| Opus 4.7 | $5.00/M | $6.25/M | $0.50/M | 90% |
Cache writes cost slightly more than standard input (25% premium). The break-even is the second hit. Every cache read after that is 90% cheaper. For any context block read more than once, caching wins.
The Implementation
import anthropic
client = anthropic.Anthropic()
# Large system prompt or knowledge base -- mark for caching
SYSTEM_CONTEXT = "Your 5,000-token knowledge base or style guide here..."
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[
{
"type": "text",
"text": SYSTEM_CONTEXT,
"cache_control": {"type": "ephemeral"}
}
],
messages=[
{"role": "user", "content": "Your specific question or task here"}
]
)
# Check cache performance in usage stats
usage = response.usage
print(f"Input tokens: {usage.input_tokens}")
print(f"Cache creation tokens: {usage.cache_creation_input_tokens}")
print(f"Cache read tokens: {usage.cache_read_input_tokens}")
On the first call, you will see cache_creation_input_tokens populated — that is the write. On subsequent calls with the same prefix, cache_read_input_tokens shows what was served from cache at 10% cost.
What to Cache
Anything large, stable, and repeated qualifies. The highest-value candidates: long system prompts defining agent behavior or personas; reference documents such as product specs, legal terms, or knowledge bases; few-shot example sets (10+ examples add up fast); conversation history in multi-turn applications where you mark the stable history prefix for caching and leave only the new turn uncached. In agentic pipelines where Claude processes a document repeatedly across multiple analysis passes, cache the document body and vary only the instruction. You pay full rate once per document, cache rate on every subsequent pass.
The CLAUDE.md and Second Brain Connection
The cold-start reduction strategy covered in Cluster 1 works because you are reducing what gets sent, not how it gets priced. Prompt caching is the complement: for context that must be sent, cache it. Together, the disciplines compound — your CLAUDE.md file keeps context lean; prompt caching ensures whatever you do send repeatedly costs 90% less after the first hit.
Cache Design Principles
Place stable content at the front of your prompt. Place dynamic content at the end. The cache key is prefix-matched, so any change to content before the cache_control marker invalidates the cache. If your system prompt has a dynamic date or session ID embedded at the top, you are guaranteeing cache misses on every call. The correct structure: static knowledge base, then cache marker, then dynamic task-specific instruction. Never the reverse.
You can set up to 4 cache breakpoints in a single request for granular control. Most production implementations need only one or two — the system prompt cache and optionally a conversation history cache in long multi-turn sessions.
Operational Impact
Teams running Claude with consistent system prompts across many requests report effective input costs dropping to near-cache-read rates after the first request. For a content pipeline running 1,000 daily requests with a 5,000-token system prompt: uncached cost is $15/day on input alone at Sonnet rates ($3/M). Cached cost after the first request: $1.50/day. That is $13.50/day saved on system prompt tokens alone — $4,900/year from one implementation change that takes an afternoon to ship.
Next: Output Compression Discipline: Concentrated Slices vs Full Meals →
