Each run gives a frontier model the same frozen prompt โ build a playable browser game.
No iteration, no human feedback. One shot.
We measure wall-clock time, token cost, and whether the result is actually fun to play.
Leaderboard
Run Details
Build Summary
Canvas-based endless wall-crawling game. Procedural wall generation, particle effects, screen shake. Web Audio API SFX (crawling, fly collection, rain warning, death, milestone chimes, shelter entry). Virtual joystick for mobile. Persistent high score via localStorage. Mute toggle.
Agent Playtest (headless Chromium)
- Start screen loads cleanly โ title, play button, controls hint
- Movement works โ spider climbs, score increases, stamina drains
- Fly collection works โ stamina recovered
- Rain system verified โ warning โ kill if unsheltered
- Game over shows correct score + best tracking, restart works
- Zero JS errors across all test runs
Notes
Agent wrote deliverable to workspace root instead of deliverable/ โ file placement gap worth tracking. No background music (SFX only). Wall variety is visual-only. Difficulty ramp is time-based.
Notes
V3 prompt with sound/SFX requirement, mandated playtest, and explicit "make it fun" objective. No run notes or checklist were filled โ agent session did not log detailed metrics. Build is a single self-contained HTML file.
How It Works
Frozen Prompt
Every run gets the identical brief. Same game, same constraints, same anti-drift guardrails. No cherry-picking.
Sandboxed Agent
Context-free agent โ no memory, no skills, no conversation history. Just the prompt and a clean workspace.
One Shot
No human feedback loops. The model plans, builds, tests, and delivers in a single pass.
Scored
Objective gates (does it load, does it play) + subjective quality review (design, UX, polish, fun).
Prompt Version
| Version | Changes |
|---|---|
V3 |
Wall-crawling spider game. Sound/SFX required, mandated playtest, explicit "make it fun" objective. No web mechanic โ pure wall traversal, route planning, stamina, flies, rain, and shelter. |
Thinking Levels (MiMo)
| Level | Reasoning Tokens | Cost Impact |
|---|---|---|
off | 0 | baseline |
minimal | ~400 | negligible |
low | ~2,000 | negligible |
medium | ~4,000 | +1% |
high (default) | ~10,000 | +3% |
xhigh | ~20,000 | +6% |
max | ~40,000 | +12% |