Benchmark

Not all AI models are equal when it comes to writing code. We track official benchmark scores from Anthropic, OpenAI, and Google DeepMind alongside our own hands-on vibe coding tests so you can pick the right model for your next project.

Last updated March 18, 2026 · 7 models

SWE-bench Verified

Agentic coding

Single attempt

—

SWE-Bench Pro

Diverse agentic coding tasks

Single attempt

57.7%

Terminal-Bench 2.0

Agentic terminal coding

Terminus-2 harness

75.1%

LiveCodeBench Pro

Competitive coding problems from Codeforces, ICPC, and IOI

Elo

—

SciCode

Scientific research coding

—

OSWorld-Verified

Desktop computer use tasks

75.0%

MCP Atlas

Multi-step workflows using MCP

67.2%

τ2-bench

Agentic and tool use

Retail

—

τ2-bench

Agentic and tool use

Telecom

98.9%

APEX-Agents

Long horizon professional tasks

—

MRCR v2 (128K)

Long context performance

128k (average)

—

Speed

Characters per second

—

GPQA Diamond

Scientific knowledge

No tools

92.8%

Humanity's Last Exam

Academic reasoning (full set, text + MM)

No tools

39.8%

Humanity's Last Exam

Academic reasoning (full set, text + MM)

Search (blocklist) + Code

52.1%

ARC-AGI v2

Abstract reasoning puzzles

ARC Prize Verified

73.3%

GDPval-AA Elo

Expert tasks

—

BrowseComp

Agentic search

Search + Python + Browse

82.7%

MMMU-Pro

Multimodal understanding and reasoning

No tools

81.2%

MMMLU

Multilingual Q&A

—

Context

1.0M

Input $/M

$2.5

Output $/M

$15

Released

Mar. 2026

		GPT-5.4 OpenAI	Gemini 3.1 Pro Google	Claude Opus 4.6 Anthropic	Claude Sonnet 4.6 Anthropic

SWE-bench Verified Agentic coding	Single attempt	—	80.6%	80.8%	79.6%
SWE-Bench Pro Diverse agentic coding tasks	Single attempt	57.7%	54.2%	—	—
Terminal-Bench 2.0 Agentic terminal coding	Terminus-2 harness	75.1%	68.5%	65.4%	59.1%
LiveCodeBench Pro Competitive coding problems from Codeforces, ICPC, and IOI	Elo	—	2,887	—	—
SciCode Scientific research coding		—	59.0%	52.0%	47.0%

OSWorld-Verified Desktop computer use tasks		75.0%	—	72.7%	72.5%
MCP Atlas Multi-step workflows using MCP		67.2%	69.2%	59.5%	61.3%
τ2-bench Agentic and tool use	Retail	—	90.8%	91.9%	91.7%
τ2-bench Agentic and tool use	Telecom	98.9%	99.3%	99.3%	97.9%
APEX-Agents Long horizon professional tasks		—	33.5%	29.8%	—
MRCR v2 (128K) Long context performance	128k (average)	—	84.9%	84.0%	84.9%
Speed Characters per second		—	99 c/s	126 c/s	160 c/s

GPQA Diamond Scientific knowledge	No tools	92.8%	94.3%	91.3%	89.9%
Humanity's Last Exam Academic reasoning (full set, text + MM)	No tools	39.8%	44.4%	40.0%	33.2%
	Search (blocklist) + Code	52.1%	51.4%	53.1%	49.0%
ARC-AGI v2 Abstract reasoning puzzles	ARC Prize Verified	73.3%	77.1%	68.8%	58.3%
GDPval-AA Elo Expert tasks		—	1,317	1,606	1,633

BrowseComp Agentic search	Search + Python + Browse	82.7%	85.9%	84.0%	74.7%
MMMU-Pro Multimodal understanding and reasoning	No tools	81.2%	80.5%	73.9%	74.5%
MMMLU Multilingual Q&A		—	92.6%	91.1%	89.3%

Context		1.0M	1.0M	200K	200K
Input $/M		$2.5	$2.5	$5	$3
Output $/M		$15	$15	$25	$15
Released		Mar. 2026	Feb. 2026	Feb. 2026	Feb. 2026

Scores sourced from official model cards by Anthropic, OpenAI, and Google DeepMind

Real-World Benchmark Tests

We gave every model the same prompts and ran the outputs ourselves — no cherry-picking, no synthetic scores. Creative HTML and Landing Page tests show the actual rendered results so you can judge quality directly. More scored tests coming soon.

Select Visual Test

What makes it hard

Multi-layer fire physics, rising embers, and ambient glow that pulses with the flame

What makes it hard

Merging drop paths, convincing lens refraction, and bokeh cityscape behind the glass

What makes it hard

Wax drip physics, dynamic shadow casting, and a flame that reacts to simulated air

What makes it hard

Organic sine-wave light curtains that shift color, plus a lake reflection that mirrors the sky

What makes it hard

Particle granule physics, realistic pile accumulation, and a smooth flip-and-restart cycle

What makes it hard

Non-repeating flicker timing with per-letter failure states, wet ground reflection, and moth orbital paths

What makes it hard

Bioluminescent glow that lights the water, pulsing bell contraction, and flowing tentacle physics

What makes it hard

Collision detection across ramps and pegs, momentum transfer, and glassy specular highlights

What makes it hard

SVG deformation that looks natural, a working clock showing real time, and a full surrealist scene

What makes it hard

Thousands of stars with differential angular velocity, depth-of-field blur, and nebula color clouds

You be the judge

No scores for this category. Compare the outputs and decide which model wins.