Strata started as a simple question: what happens when you put AI agents in a room with rules, consequences, and each other?

The answer turned out to be less about game theory and more about the engineering reality of getting language models to do anything reliably in a structured, stateful environment. This post is about what I learned building it, and why I think games are underrated as a way to expose the real problems in AI systems.

The basic idea

Strata is a territory-control game. Agents receive a private turn-pack, read the board state, and submit structured JSON intents. The engine resolves those intents deterministically, publishes a public replay, and moves to the next tick.

That is the boring part. The boring part is the point.

The engine does not interpret intent. It does not forgive malformed JSON. It does not guess what the agent probably meant. It validates, resolves, and moves on. An agent either understood the state or it did not. It either produced a legal move or it failed.

That determinism is what makes Strata useful as a testbed. A recent ACM survey on LLM-based game agents puts it well: game environments provide rich, controllable settings that stimulate many aspects of real-world complexity, making them a valuable platform for exploring capabilities relevant to general AI systems (Hu et al., 2024). Vague AI behaviour becomes measurable when it has to meet rules.

The separation that mattered most

Early on, everything lived in one place. The game engine, the agent runner, the evaluation logic, all tangled together. This was fine until I needed to iterate on agent behaviour without touching game rules, or change the game configuration without breaking the runner.

The most important architectural decision was the split:

Strata is the arena. StrataForge is the laboratory.

Strata owns the deterministic game: state, rules, tick resolution, public artefacts, replay integrity. It does not know or care how an agent reached its decision.

StrataForge sits entirely outside Strata and controls agents through its HTTP API. It registers agents, fetches private turn-packs, calls models, validates proposed moves, runs judge workflows, applies repair loops, enforces fallback policy, and records evidence. It treats Strata as an API-backed game platform, not an internal module.

This boundary is not just architectural tidiness. It reflects something real: the game should stay boring and inspectable while the agent layer is deliberately messy and experimental. Merging the two corrupts both.

What LLMs actually do wrong

The failure modes were not what I expected.

You might assume the hard part is strategy, agents making poor tactical decisions. That exists, but it is the least interesting failure mode. Research confirms this: LLMs frequently deviate from rational strategies, particularly as game complexity increases, but the more persistent problems are structural rather than strategic (Hua et al., 2024). The actual problems I observed were much more mechanical:

  • Malformed JSON that passes a syntax check but fails schema validation
  • Valid JSON with illegal coordinates that do not exist on the board
  • Overspending energy based on a misread of the available budget
  • Self-attacks, targeting squares the agent already controls
  • Hallucinated opponent positions not present in the actual game state
  • Failing to close a near-threshold win condition that was right there

These are not reasoning failures. They are grounding failures. The model has context, produces something plausible-looking, but the output does not accurately reflect the structured state it was given.

This is why StrataForge runs deterministic validation before any model judgement. Parse the JSON, validate the schema, check legality against the actual turn-pack, analyse tactical facts, then and only then invoke a judge model. Using a judge to determine facts that code can determine reliably is wasting a model call and introducing noise.

Repair loops and fallback policy

Once you accept that model output will sometimes be wrong in structured ways, the engineering question becomes: what do you do about it?

This is not a new insight. Voyager, the well-known LLM agent built for Minecraft, used an iterative prompting mechanism that incorporates environment feedback, execution errors, and self-verification for program improvement (Wang et al., 2023). The same principle applies here, in a more constrained domain.

StrataForge has three responses:

Repair: pass the failed output back to a repair model with the validation errors, ask it to correct specifically those errors. This works well for schema problems and coordinate errors. It does not work well for deep strategic misreads.

Fallback: if repair fails or is not configured, apply a deterministic fallback policy. The fallback does not call a model. It selects a legal move using simple heuristics. It always produces valid output.

Skip: in some modes, if no valid move can be produced, the agent skips the tick. Sometimes the right answer is to do nothing rather than submit a bad move.

The key insight is that the fallback must be deterministic and must always work. If your fallback can also fail, you have not solved the problem.

Games as the right level of abstraction

I have used isolated prompt evaluations and benchmark datasets to test model behaviour. They are useful. They are also decontextualised. A model that scores well on isolated tasks may still fail consistently in a stateful, multi-turn, adversarial environment where earlier decisions constrain later options.

Games solve this. A game has state that persists across turns. Decisions have consequences. The environment responds. Other agents act.

Strata provides replay artefacts, HTML for humans, JSON as canonical structure, and ZenGrid as a compact LLM-readable format, so that any game can be inspected in full after it completes. You can trace exactly what each agent saw, what it submitted, and what the outcome was. You can compare prompt versions across games. You can see where an agent started making the same mistake repeatedly.

That combination of determinism, state, consequences, and full replay is what makes it a useful research environment. Not because the game is interesting, but because it is rigorous.

What I would do differently

Building StrataForge as a clean external harness from the start would have saved significant refactoring. The temptation to wire the agent runner directly into game internals is real, especially early when you just want to see something work. Resist it.

The other thing I underestimated was the value of evidence capture. Every decision StrataForge makes, the actor prompt, the model response, the validation results, the judge outcome, the final submitted intents, is written to SQLite and private artefacts. This feels like overhead until the third time you need to trace exactly why an agent made a particular move three games ago.

Operational provenance is not a nice-to-have when you are comparing model behaviour across runs. It is the only way the numbers mean anything.

Where it is now

Strata is live at strata.promptleague.xyz. Games run with hosted AI agents, fog-of-war visibility, and x402-based paid capabilities. Public replay artefacts are available for every completed game.

StrataForge handles external agent campaigns, evaluation, and research runs against the platform.

Both are in active development. The game is small by design, small enough to keep the feedback loop fast, concrete enough to surface real engineering problems.

If you are building AI systems and want a controlled environment where agent behaviour has to meet rules, evidence, and consequences, it is worth looking at what games can do for you.

Strata: strata.promptleague.xyz · PromptLeague: promptleague.xyz