Benchmark

Not all AI models are equal when it comes to writing code. We track official benchmark scores from Anthropic, OpenAI, and Google DeepMind alongside our own hands-on vibe coding tests so you can pick the right model for your next project.

Last updated March 18, 2026 · 7 models

SWE-bench Verified
Agentic coding
Single attempt
SWE-Bench Pro
Diverse agentic coding tasks
Single attempt
57.7%
Terminal-Bench 2.0
Agentic terminal coding
Terminus-2 harness
75.1%
LiveCodeBench Pro
Competitive coding problems from Codeforces, ICPC, and IOI
Elo
SciCode
Scientific research coding
OSWorld-Verified
Desktop computer use tasks
75.0%
MCP Atlas
Multi-step workflows using MCP
67.2%
τ2-bench
Agentic and tool use
Retail
τ2-bench
Agentic and tool use
Telecom
98.9%
APEX-Agents
Long horizon professional tasks
MRCR v2 (128K)
Long context performance
128k (average)
Speed
Characters per second
GPQA Diamond
Scientific knowledge
No tools
92.8%
Humanity's Last Exam
Academic reasoning (full set, text + MM)
No tools
39.8%
Humanity's Last Exam
Academic reasoning (full set, text + MM)
Search (blocklist) + Code
52.1%
ARC-AGI v2
Abstract reasoning puzzles
ARC Prize Verified
73.3%
GDPval-AA Elo
Expert tasks
BrowseComp
Agentic search
Search + Python + Browse
82.7%
MMMU-Pro
Multimodal understanding and reasoning
No tools
81.2%
MMMLU
Multilingual Q&A
Context
1.0M
Input $/M
$2.5
Output $/M
$15
Released
Mar. 2026
Scores sourced from official model cards by Anthropic, OpenAI, and Google DeepMind

Real-World Benchmark Tests

We gave every model the same prompts and ran the outputs ourselves — no cherry-picking, no synthetic scores. Creative HTML and Landing Page tests show the actual rendered results so you can judge quality directly. More scored tests coming soon.

Select Visual Test

What makes it hard

Multi-layer fire physics, rising embers, and ambient glow that pulses with the flame

What makes it hard

Merging drop paths, convincing lens refraction, and bokeh cityscape behind the glass

What makes it hard

Wax drip physics, dynamic shadow casting, and a flame that reacts to simulated air

What makes it hard

Organic sine-wave light curtains that shift color, plus a lake reflection that mirrors the sky

What makes it hard

Particle granule physics, realistic pile accumulation, and a smooth flip-and-restart cycle

What makes it hard

Non-repeating flicker timing with per-letter failure states, wet ground reflection, and moth orbital paths

What makes it hard

Bioluminescent glow that lights the water, pulsing bell contraction, and flowing tentacle physics

What makes it hard

Collision detection across ramps and pegs, momentum transfer, and glassy specular highlights

What makes it hard

SVG deformation that looks natural, a working clock showing real time, and a full surrealist scene

What makes it hard

Thousands of stars with differential angular velocity, depth-of-field blur, and nebula color clouds

You be the judge

No scores for this category. Compare the outputs and decide which model wins.