AI Builder Eval

Game Bench

Can an AI build a playable browser game in one pass? Each model gets the same frozen prompt, a clean workspace, and no human feedback.

Play without seeing the model Read the protocol

Runs scored 18

Complete browser games built from the same brief.

Highest score 94.2

Claude Fable 5.0 at $6.73.

Best under $1 69.7

MiniMax M3 at $0.59.

Human feedback 0

No corrections or steering during a run.

Efficiency

Game-Bench Index vs Cost

Compare benchmark score against estimated run cost in USD.

Y-axis

Higher is better on the Y-axis. The visible score scale starts at 30 to make model differences easier to read; cost uses a log scale.

Leaderboard

Click any column header to sort. Scores are out of 100; final score is weighted.


Claude Fable 5.0	`high`	09/Jun/26	17m 32s	1.96M	$6.73	91.5	98.0	95.2	96.8	94.2
Claude Opus 4.8	`xhigh`	02/Jun/26	24m 0s	6.3M	$7.03	78.6	94.0	90.4	88.7	85.4
GLM 5.2	`high`	22/Jun/26	40m 29s	507K	$1.58	80.7	96.0	72.9	77.6	80.6
GLM 5.2	`high`	22/Jun/26	40m 29s	507K	$1.58	80.0	96.0	68.8	85.0	80.3
GPT-5 Codex	`xhigh`	02/Jun/26	42m 0s	3.1M	$4.03	73.3	90.0	76.8	74.2	76.8
GLM 5.1	`high`	22/Jun/26	28m 30s	578K	$3.35	72.9	92.0	68.4	85.0	76.5
GLM 5.1	`high`	22/Jun/26	28m 30s	578K	$3.35	74.8	92.0	67.6	79.9	76.4
MiniMax M3	`high`	01/Jun/26	34m 38s	387K	$0.59	65.8	80.0	66.6	76.1	69.7
Kimi K2.6	`high`	02/Jun/26	26m 0s	317K	$0.34	66.1	74.0	61.3	75.3	67.5
Qwen 3.7 Plus	`high`	14/Jun/26	9m 22s	271K	$0.08	51.5	80.0	68.0	45.0	58.9
Qwen 3.7 Plus	`high`	14/Jun/26	9m 22s	271K	$0.08	50.9	80.0	69.9	37.5	58.0
MiMo V2.5 Pro	`max`	01/Jun/26	6m 54s	754K	$0.05	50.0	72.0	45.3	57.3	53.2
MiMo V2.5 Pro	`high`	29/May/26	13m 27s	1.93M	$0.11	43.9	74.0	49.6	65.8	53.1
MiMo V2.5 Pro	`medium`	29/May/26	12m 0s	1.69M	$0.09	48.0	74.0	50.3	40.7	51.4
Kimi K2.7 Code	`high`	12/Jun/26	21m 32s	3.86M	$0.95	42.5	68.0	58.4	45.0	50.7
Gemini 3.5 Flash	`high`	02/Jun/26	17m 0s	2.0M	$1.47	36.5	44.0	47.1	80.1	46.8
DeepSeek V4 Pro	`high`	22/Jun/26	23m 6s	570K	$0.10	36.9	48.0	30.4	55.0	39.7
DeepSeek V4 Pro	`high`	22/Jun/26	23m 6s	570K	$0.10	37.1	44.0	28.3	30.5	34.9

Protocol

How the eval works

Frozen prompt

Every model receives the same game brief, same constraints, and same anti-drift guardrails.

Clean workspace

No memory, previous builds, project context, or hidden examples. The run starts from zero.

One shot

The model plans, builds, tests, and delivers without iterative human steering.

Playable result

The output is judged on whether it loads, plays, communicates its rules, and feels worth trying.

Score method

Public scores expose four use-case pillars. Gameplay blends design, fun, and onboarding. Adherence is prompt-following fidelity. Engineering combines cross-platform support, implementation quality, and QA. Art combines visual art, UI/UX, and audio. Final score = Gameplay 45, Adherence 15, Engineering 25, Art 15.

What this measures

Builder capability under a real artifact constraint

Game Bench asks models for a complete game: input handling, state, rendering, tutorial clarity, play feel, failure states, and enough polish that a human can actually evaluate it. The idea is to test how models working at tasks closer to real life usage, not just coding exercises.

Reproducibility

Full frozen prompt

Read the exact prompt every model received: same game brief, same constraints, same one-shot mandate, same mobile and playtest requirements.

Open frozen prompt

Next runs

Let me know what other models or harnesses you think should be added

Send model names, coding agents, IDE harnesses, or run conditions that would make the benchmark more useful to compare.