Benchmark
Not all AI models are equal when it comes to writing code. We track official benchmark scores from Anthropic, OpenAI, and Google DeepMind alongside our own hands-on vibe coding tests so you can pick the right model for your next project.
Last updated March 18, 2026 · 7 models
![]() | GPT-5.4 OpenAI | Gemini 3.1 Pro Google | Claude Opus 4.6 Anthropic | Claude Sonnet 4.6 Anthropic | |
|---|---|---|---|---|---|
SWE-bench Verified Agentic coding | Single attempt | — | 80.6% | 80.8% | 79.6% |
SWE-Bench Pro Diverse agentic coding tasks | Single attempt | 57.7% | 54.2% | — | — |
Terminal-Bench 2.0 Agentic terminal coding | Terminus-2 harness | 75.1% | 68.5% | 65.4% | 59.1% |
LiveCodeBench Pro Competitive coding problems from Codeforces, ICPC, and IOI | Elo | — | 2,887 | — | — |
SciCode Scientific research coding | — | 59.0% | 52.0% | 47.0% | |
OSWorld-Verified Desktop computer use tasks | 75.0% | — | 72.7% | 72.5% | |
MCP Atlas Multi-step workflows using MCP | 67.2% | 69.2% | 59.5% | 61.3% | |
τ2-bench Agentic and tool use | Retail | — | 90.8% | 91.9% | 91.7% |
| Telecom | 98.9% | 99.3% | 99.3% | 97.9% | |
APEX-Agents Long horizon professional tasks | — | 33.5% | 29.8% | — | |
MRCR v2 (128K) Long context performance | 128k (average) | — | 84.9% | 84.0% | 84.9% |
Speed Characters per second | — | 99 c/s | 126 c/s | 160 c/s | |
GPQA Diamond Scientific knowledge | No tools | 92.8% | 94.3% | 91.3% | 89.9% |
Humanity's Last Exam Academic reasoning (full set, text + MM) | No tools | 39.8% | 44.4% | 40.0% | 33.2% |
| Search (blocklist) + Code | 52.1% | 51.4% | 53.1% | 49.0% | |
ARC-AGI v2 Abstract reasoning puzzles | ARC Prize Verified | 73.3% | 77.1% | 68.8% | 58.3% |
GDPval-AA Elo Expert tasks | — | 1,317 | 1,606 | 1,633 | |
BrowseComp Agentic search | Search + Python + Browse | 82.7% | 85.9% | 84.0% | 74.7% |
MMMU-Pro Multimodal understanding and reasoning | No tools | 81.2% | 80.5% | 73.9% | 74.5% |
MMMLU Multilingual Q&A | — | 92.6% | 91.1% | 89.3% | |
Context | 1.0M | 1.0M | 200K | 200K | |
Input $/M | $2.5 | $2.5 | $5 | $3 | |
Output $/M | $15 | $15 | $25 | $15 | |
Released | Mar. 2026 | Feb. 2026 | Feb. 2026 | Feb. 2026 | |
Real-World Benchmark Tests
We gave every model the same prompts and ran the outputs ourselves — no cherry-picking, no synthetic scores. Creative HTML and Landing Page tests show the actual rendered results so you can judge quality directly. More scored tests coming soon.
Select Visual Test
What makes it hard
Multi-layer fire physics, rising embers, and ambient glow that pulses with the flame
What makes it hard
Merging drop paths, convincing lens refraction, and bokeh cityscape behind the glass
What makes it hard
Wax drip physics, dynamic shadow casting, and a flame that reacts to simulated air
What makes it hard
Organic sine-wave light curtains that shift color, plus a lake reflection that mirrors the sky
What makes it hard
Particle granule physics, realistic pile accumulation, and a smooth flip-and-restart cycle
What makes it hard
Non-repeating flicker timing with per-letter failure states, wet ground reflection, and moth orbital paths
What makes it hard
Bioluminescent glow that lights the water, pulsing bell contraction, and flowing tentacle physics
What makes it hard
Collision detection across ramps and pegs, momentum transfer, and glassy specular highlights
What makes it hard
SVG deformation that looks natural, a working clock showing real time, and a full surrealist scene
What makes it hard
Thousands of stars with differential angular velocity, depth-of-field blur, and nebula color clouds
No scores for this category. Compare the outputs and decide which model wins.

