Claude vs GPT vs Gemini: Coding Benchmark Leaderboard (June 2026)

Q: Which model has the highest published coding benchmark score in June 2026?

No single winner can be confirmed from primary sources alone, because Anthropic and OpenAI publish their coding scores in formats not machine-verifiable on June 13, 2026. From readable figures, Google's Gemini 3.1 Pro model card reports 80.6% on SWE-bench Verified and 54.2% on SWE-bench Pro (Public). Anthropic's and OpenAI's comparable figures are in their system cards and announcement pages.

Q: What does Claude Fable 5 cost, and how is it different from Opus 4.8?

Claude Fable 5 (claude-fable-5) costs $10 per million input tokens and $50 per million output tokens, with a 1M-token context window and up to 128K output tokens. Claude Opus 4.8 (claude-opus-4-8) is the Opus-tier flagship at $5 / $25 per Mtok, also 1M context and 128K output, with a January 2026 knowledge cutoff. Fable 5 is Anthropic's most capable widely released model; Opus 4.8 is the lower-priced everyday agentic-coding model.

Q: Why are some benchmark cells marked not machine-verifiable instead of showing a number?

This page only prints scores confirmable from a primary source on the verification date. Several vendors render benchmark tables as images, and one large system-card PDF exceeded the fetch limit, so the underlying percentages were not readable. Rather than copy third-party figures, the cell is marked and the official document is linked.

Last verified: June 13, 2026

As of June 13, 2026, the four models most often compared for coding work are Claude Fable 5 and Claude Opus 4.8 from Anthropic, GPT-5.5 from OpenAI, and Gemini 3.1 Pro from Google. This page is a leaderboard built on one rule: every score below is taken from a vendor’s own page or the benchmark’s official model card that we fetched on the verification date, or it is marked as not published. Several vendors publish their benchmark tables as images rather than machine-readable text; where we could not read an official figure directly, we list the metric as not machine-verifiable and link to the source document instead of estimating. The result is a smaller table than most roundups, but every number in it is one you can click through and check.

Models and pricing (verified specs)

These columns are confirmed from each vendor’s official model documentation. Claude prices, context windows, and cutoffs come from Anthropic’s models overview and the AWS Bedrock model card; GPT-5.5 from OpenAI’s developer docs; Gemini 3.1 Pro from Google’s DeepMind model card and the Gemini API pricing page.

Model	API ID	Input / Output (per Mtok)	Context	Max output	Knowledge cutoff
Claude Fable 5	`claude-fable-5`	$10 / $50	1M	128K	Not stated on overview*
Claude Opus 4.8	`claude-opus-4-8`	$5 / $25	1M	128K	Jan 2026
GPT-5.5	`gpt-5.5`	$5 / $30	1,050,000	128K	Dec 1, 2025
Gemini 3.1 Pro	`gemini-3.1-pro-preview`	$2 / $12 (â‰¤200K)**	1M	64K	Not stated on model card

*Anthropic’s models overview lists Fable 5’s specs and price but does not publish a knowledge-cutoff date for it in the table we fetched. **Gemini 3.1 Pro uses tiered pricing: $2 / $12 per Mtok for prompts up to 200K tokens, rising to $4 / $18 for prompts above 200K tokens (Google AI pricing page). GPT-5.5 pricing rises to 2x input / 1.5x output above 272K input tokens (OpenAI developer docs). Claude Opus 4.8 offers an optional fast mode at $10 / $50 per Mtok (Anthropic).

Coding benchmark scores (primary-source only)

Each cell is either a figure we read directly from a primary source on June 13, 2026, or marked “not machine-verifiable” with the source you should consult. A blank-equivalent entry never means zero â€” it means the official figure was not available in readable form during verification. Note the harness and version differences called out in the footnotes: they make cross-vendor cells not strictly comparable.

Benchmark	Claude Fable 5	Claude Opus 4.8	GPT-5.5	Gemini 3.1 Pro
SWE-bench Verified	Not machine-verifiable (see system card)	Not machine-verifiable (see system card)	Not published in retrievable primary source	80.6%
SWE-bench Pro (Public)	Not machine-verifiable (see system card)	Not machine-verifiable (see system card)	Not published in retrievable primary source	54.2%
Terminal-Bench	Not machine-verifiable (see system card)	Not machine-verifiable (see system card)	83.4% (v2.1, Codex CLI harness)â€	68.5% (v2.0, Terminus-2 harness)
LiveCodeBench Pro	Not published in retrievable primary source	Not published in retrievable primary source	Not published in retrievable primary source	2887 Elo

â€ GPT-5.5’s Terminal-Bench 2.1 figure of 83.4% is the score Anthropic attributes to GPT-5.5 “with the Codex CLI harness” in a footnote on its Claude Opus 4.8 announcement page. It is a competitor-reported comparison, not a number we read from OpenAI directly. Google reports Gemini 3.1 Pro on Terminal-Bench 2.0 under the Terminus-2 harness (68.5%); because the version and harness differ, the Gemini and GPT-5.5 Terminal-Bench cells are not directly comparable. Gemini’s SWE-bench Verified (80.6%), SWE-bench Pro Public (54.2%), and LiveCodeBench Pro (2887 Elo) are single-attempt figures from Google’s official Gemini 3.1 Pro model card.

What we could not verify from a primary source

Anthropic publishes its coding comparison tables for Claude Opus 4.8 and Claude Fable 5 as images inside its announcement pages, and the full Claude Opus 4.8 System Card PDF exceeded our fetch size limit, so we could not machine-read those percentages on the verification date. OpenAI’s GPT-5.5 announcement page returned an access error to our fetcher, and its developer-docs model page lists specs and pricing but no benchmark scores. We have therefore left Claude’s and GPT-5.5’s SWE-bench figures out of the table rather than reproduce numbers we could not confirm at the source. For those figures, consult the primary documents linked in our source list: the Claude Opus 4.8 System Card, the Claude Fable 5 and Mythos 5 announcement, and OpenAI’s GPT-5.5 page. If you are choosing a model today, the verified spec table above (price, context, output, cutoff) is the part you can rely on without caveat.

How to read a coding leaderboard

Three cautions apply to any 2026 coding comparison. First, harness matters: the same model scores differently on Terminal-Bench depending on whether it runs under Terminus-2, a Codex CLI scaffold, or a vendor’s internal agent, which is why we annotate every Terminal-Bench cell. Second, version matters: “Terminal-Bench 2.0” and “Terminal-Bench 2.1” are different test sets, and “SWE-bench Pro” public and full splits differ â€” a single percentage with no version is close to meaningless. Third, a headline score is one slice of behavior; long-horizon agentic coding, tool-call reliability, and context handling over a long session often decide real-world usefulness more than a single pass rate. Treat the verified cells here as a starting point, then test the shortlist on your own repository.

Which model has the highest published coding benchmark score in June 2026?

We cannot crown a single winner from primary sources alone, because Anthropic and OpenAI publish their coding scores in formats we could not machine-verify on June 13, 2026. From figures we could read directly, Google’s Gemini 3.1 Pro model card reports 80.6% on SWE-bench Verified and 54.2% on SWE-bench Pro (Public). Anthropic’s and OpenAI’s comparable figures are in their system cards and announcement pages, which we link in the sources; we did not reproduce them here because they were not readable at the source during verification.

What does Claude Fable 5 cost, and how is it different from Opus 4.8?

Claude Fable 5 (claude-fable-5) is priced at $10 per million input tokens and $50 per million output tokens, with a 1M-token context window and up to 128K output tokens (Anthropic models overview). Claude Opus 4.8 (claude-opus-4-8) is the Opus-tier flagship at $5 / $25 per Mtok, also 1M context and 128K output, with a January 2026 knowledge cutoff. Fable 5 is Anthropic’s most capable widely released model; Opus 4.8 is the lower-priced model most teams will use for everyday agentic coding.

Why are some benchmark cells marked “not machine-verifiable” instead of showing a number?

Because this page only prints scores we could confirm from a primary source on the verification date. Several vendors render their benchmark tables as images, and one large system-card PDF exceeded our fetch limit, so the underlying percentages were not readable to us. Rather than copy figures from third-party trackers, we mark the cell and point you to the official document. It keeps the leaderboard honest at the cost of being shorter.

How do the context windows compare?

Claude Fable 5, Claude Opus 4.8, and Gemini 3.1 Pro each offer a 1M-token context window; GPT-5.5 offers 1,050,000 tokens. Maximum output is 128K tokens for Claude Fable 5, Claude Opus 4.8, and GPT-5.5, and 64K tokens for Gemini 3.1 Pro. Note that Claude Opus 4.8’s context window is 200K on Microsoft Foundry specifically, per Anthropic’s documentation.

Is Terminal-Bench comparable across these models?

Not cell-for-cell. Google reports Gemini 3.1 Pro on Terminal-Bench 2.0 under the Terminus-2 harness (68.5%), while the GPT-5.5 figure we show (83.4%) is Terminal-Bench 2.1 under a Codex CLI harness, as attributed by Anthropic. Different versions and different harnesses mean the two numbers should not be read as a head-to-head result.

What to explore next

AI Strategy

Can Claude Read PDFs? Yes — Here’s Exactly How It Works

Same room

AI Strategy

How Claude Cowork Can Fix the Handoff Problem in B2B SaaS Teams

Same room

The Machine Room

AI Is Citing Your Client’s Competitors. Here’s What That Means for Your Retainer.

You may also explore

Deep dive

Everett Food & Drink

Middleton Brewing: South Everett’s Nano-Brewpub Is the Fruit Ale Spot the Rest of the City Forgot to Tell You About

Deep dive

Track the AI tools you actually use

Live, vendor-neutral prices & limits for ChatGPT, Claude, Gemini, Perplexity and more — and we’ll email you the moment your tools change price or limits. Free, no hype.

See the live AI tracker →or set up your alerts

Claude vs GPT vs Gemini: Coding Benchmark Leaderboard (June 2026)

Models and pricing (verified specs)

Coding benchmark scores (primary-source only)

What we could not verify from a primary source

How to read a coding leaderboard

Which model has the highest published coding benchmark score in June 2026?

What does Claude Fable 5 cost, and how is it different from Opus 4.8?

Why are some benchmark cells marked “not machine-verifiable” instead of showing a number?

How do the context windows compare?

Is Terminal-Bench comparable across these models?

Comments

Leave a Reply Cancel reply

More posts

AI Agents Are Learning to Check Instead of Guess: The GitHub Context Problem

Logic Apps vs Cloud Workflows: No-Code Automation Across Two Clouds

Azure Static Web Apps vs Firebase Hosting: A Dashboard on Each

Cosmos DB vs Firestore: A Free-Tier Operations Ledger on Both Clouds