A contamination-resistant benchmark for evaluating LLM strategic reasoning through creature combat simulation.
The same LLMs, the same simulator, the same opponents — but a redesigned prompt flips the leaderboard.
Novel game mechanics that don't exist in any training corpus. LLMs can't memorize solutions — they must reason from the rules provided.
Rock-paper-scissors dominance structures emerge naturally. Bradley-Terry scoring handles intransitivity that Elo cannot.
Given the same seed, the same combat plays out identically. RNG contributes <25% of outcome variance — strategy dominates.
Track A tests fixed strategy. Track B tests adaptation. Track C tests meta-conditioning. Track D tests tool use. Each isolates a different capability.
Format: animal hp atk spd wil — stats sum to 20, each ≥ 1
Think you can build a better fighter? Describe your strategy and see where you'd rank against all 13 tournament agents.
Challenge the Leaderboard