๐ŸŽฎ Game Bench

Can an AI build a fun game in one pass?

Each run gives a frontier model the same frozen prompt โ€” build a playable browser game. No iteration, no human feedback. One shot.
We measure wall-clock time, token cost, and whether the result is actually fun to play.

Leaderboard

Run Model Thinking Date Time Cost Verdict Play
high MiMo V2.5 Pro high 2026-05-29 13m 27s $0.11 Pass Play โ†’
medium MiMo V2.5 Pro medium 2026-05-29 ~12 min $0.09 Pass Play โ†’

Run Details

high 2026-05-29 ยท MiMo V2.5 Pro ยท thinking=high
Play Game โ†’
Time13m 27s
Cost$0.11
Tokens1.93M
Lines771
Files1

Build Summary

Canvas-based endless wall-crawling game. Procedural wall generation, particle effects, screen shake. Web Audio API SFX (crawling, fly collection, rain warning, death, milestone chimes, shelter entry). Virtual joystick for mobile. Persistent high score via localStorage. Mute toggle.

Agent Playtest (headless Chromium)

  • Start screen loads cleanly โ€” title, play button, controls hint
  • Movement works โ€” spider climbs, score increases, stamina drains
  • Fly collection works โ€” stamina recovered
  • Rain system verified โ€” warning โ†’ kill if unsheltered
  • Game over shows correct score + best tracking, restart works
  • Zero JS errors across all test runs

Notes

Agent wrote deliverable to workspace root instead of deliverable/ โ€” file placement gap worth tracking. No background music (SFX only). Wall variety is visual-only. Difficulty ramp is time-based.

medium 2026-05-29 ยท MiMo V2.5 Pro ยท thinking=medium
Play Game โ†’
Time~12 min
Cost$0.09
Tokens1.69M

Notes

V3 prompt with sound/SFX requirement, mandated playtest, and explicit "make it fun" objective. No run notes or checklist were filled โ€” agent session did not log detailed metrics. Build is a single self-contained HTML file.

How It Works

๐Ÿ“

Frozen Prompt

Every run gets the identical brief. Same game, same constraints, same anti-drift guardrails. No cherry-picking.

๐Ÿ”’

Sandboxed Agent

Context-free agent โ€” no memory, no skills, no conversation history. Just the prompt and a clean workspace.

โฑ๏ธ

One Shot

No human feedback loops. The model plans, builds, tests, and delivers in a single pass.

๐ŸŽฏ

Scored

Objective gates (does it load, does it play) + subjective quality review (design, UX, polish, fun).

Prompt Version

Version Changes
V3 Wall-crawling spider game. Sound/SFX required, mandated playtest, explicit "make it fun" objective. No web mechanic โ€” pure wall traversal, route planning, stamina, flies, rain, and shelter.

Thinking Levels (MiMo)

Level Reasoning Tokens Cost Impact
off0baseline
minimal~400negligible
low~2,000negligible
medium~4,000+1%
high (default)~10,000+3%
xhigh~20,000+6%
max~40,000+12%