The Batch API: 50% Off for Non-Urgent Claude Work

Every dollar you spend on Claude at full synchronous price is a dollar you’re overpaying for non-urgent work. Anthropic’s Message Batches API delivers a flat 50% discount on both input and output tokens — the same models, the same quality, half the price — with one constraint: results arrive asynchronously, typically within 24 hours.

This is part of the Claude on a Budget series. If you’re routing models for real-time work, see Model Routing: Haiku vs Sonnet vs Opus. For cutting repeated context costs, see Prompt Caching.

The Math First

Standard Sonnet 4.6 pricing: $3.00 input / $15.00 output per million tokens. Batch Sonnet 4.6: $1.50 input / $7.50 output. Run 1,000 article drafts synchronously and you’re spending full rate on every one. Run the same batch overnight and you cut the bill in half — no model quality change, no output degradation, just a different delivery mechanism.

ModelSync InputSync OutputBatch InputBatch Output
Haiku 4.5$1.00/M$5.00/M$0.50/M$2.50/M
Sonnet 4.6$3.00/M$15.00/M$1.50/M$7.50/M
Opus 4.7$5.00/M$25.00/M$2.50/M$12.50/M

What Qualifies as Non-Urgent Work

The honest question is not “does this need to be fast?” — it’s “does this need to be synchronous?” Most content pipelines, data enrichment tasks, classification jobs, and bulk translation runs have no real-time dependency. The user is not waiting at a keyboard. The output feeds a queue. The 24-hour window is irrelevant. Candidates include: nightly article drafts, SEO metadata generation for large post archives, batch product description rewrites, email personalization at scale, sentiment tagging across historical data, bulk summarization of documents or transcripts.

What does not qualify: customer-facing chat, real-time code completion, any workflow where a human is actively waiting for a response.

The API Pattern

import anthropic

client = anthropic.Anthropic()

# Build your batch — each request is a full message payload
requests_list = [
    {
        "custom_id": f"article-{i}",
        "params": {
            "model": "claude-sonnet-4-6",
            "max_tokens": 2000,
            "messages": [
                {"role": "user", "content": f"Write a 500-word expert summary of: {topic}"}
            ]
        }
    }
    for i, topic in enumerate(topics)
]

# Submit the batch
batch = client.messages.batches.create(requests=requests_list)
print(f"Batch ID: {batch.id} | Status: {batch.processing_status}")

# Poll until complete
import time
while True:
    status = client.messages.batches.retrieve(batch.id)
    if status.processing_status == "ended":
        break
    time.sleep(60)

# Retrieve results
for result in client.messages.batches.results(batch.id):
    custom_id = result.custom_id
    if result.result.type == "succeeded":
        text = result.result.message.content[0].text
        print(f"{custom_id}: {text[:100]}...")

Combining Batch API With Prompt Caching

These two discounts stack. If your batch requests share a large system prompt — a style guide, a knowledge base, a persona definition — mark that block with cache_control: {"type": "ephemeral"}. Anthropic caches it across all requests in the batch that hit the same prompt prefix. You pay input rate on the first hit and cache read rate (roughly 10% of input rate) on every subsequent hit. A 10,000-token system prompt shared across 500 batch requests: you pay full rate once, cache rate 499 times, and you are already on batch pricing for all output tokens. The compounding effect is significant.

Structuring Your Pipeline Around Batch Windows

The practical architecture: identify every Claude call in your current workflow that has no real-time dependency. Move those calls behind a queue. Set a nightly cron that drains the queue into a batch submission at 11 PM. Results are ready by morning. Your synchronous Claude budget drops to customer-facing interactions only — often 20-30% of total volume for content and data operations teams.

Rate limits are separate for batch vs. synchronous traffic, so batch jobs do not compete with your real-time usage. That is a free operational benefit on top of the price cut.

Error Handling at Scale

Batch results include a result.type field: succeeded, errored, or canceled. Always iterate the full result set and collect errored custom_ids for resubmission. At scale — thousands of requests — you will see occasional errors. Build the retry loop into your pipeline from day one rather than discovering it when 3% of a 10,000-request batch silently fails.

The Honest Tradeoff

Batch API is a discipline, not a feature. It requires you to think about your Claude usage in terms of urgency tiers, not just prompt quality. Teams that adopt it consistently cut their Claude bills by 30-50% on total spend — not because every call moves to batch, but because the non-urgent majority does. Combined with model routing (Haiku for triage, Sonnet for batch drafts, Opus only for synchronous high-stakes reasoning), it is the highest-leverage cost lever available in the Anthropic stack today.

Next: Prompt Caching: How to Cut Repeated Context Costs by Up to 90% →