We present LLM-Quest, a benchmark for evaluating how agent architecture affects LLM decision-making in text-based interactive environments. Unlike most LLM benchmarks that vary tasks and models while treating the agent as a black box, we introduce agent architecture as a first-class experimental dimension. We evaluate six mid-tier production language models across five agent modes -- baseline prompting, structured reasoning, domain knowledge injection, multi-step planning, and tool-augmented agents -- on a corpus of branching-narrative quests that require sequential decision-making over 10-50+ turns. Our three-dimensional evaluation (models x quests x agent modes) reveals [preliminary: the relationship between cognitive scaffolding and task performance, including which architectural interventions help which categories of tasks]. We release the full benchmark framework, evaluation data, and an interactive leaderboard to support reproducible agent architecture research.
The rapid improvement in large language model capabilities has produced a parallel expansion in agent frameworks: chain-of-thought prompting, ReAct-style reasoning, tool use, and multi-step planning are now standard components of production LLM systems. Yet most evaluation benchmarks compare models on a fixed set of tasks using a single agent architecture, collapsing the architecture dimension into the model comparison.
This conflation obscures a practical question that matters to practitioners: given a fixed model and a fixed task, how much does the agent wrapper matter? Is a $0.02/run model with planning competitive against a $0.04/run model without it? Does domain knowledge injection unlock tasks that pure reasoning cannot solve?
We address this gap with LLM-Quest, a benchmark that evaluates LLM agents along three dimensions simultaneously: model (six mid-tier production models from different providers), task (a corpus of branching-narrative quests with diverse difficulty), and agent architecture (five modes ranging from bare prompting to tool-augmented agents).
AgentBench (Liu et al., 2023) evaluates LLM agents across eight environments including web browsing, code execution, and database operations. SWE-bench (Jimenez et al., 2024) focuses on software engineering tasks, measuring an agent's ability to resolve real GitHub issues. AgentQuest (Gioacchini et al., NAACL 2024) provides a modular framework for agent evaluation with pluggable environments. These benchmarks compare models on a fixed agent architecture; we vary the architecture as an independent variable.
TextQuests (Bakaeva et al., 2025) uses a similar quest corpus (Space Rangers .qm format) for LLM evaluation, but focuses on model comparison with a single agent architecture. Our work extends this approach by adding the agent mode dimension and comparing five architectures per model. TextWorld (Cote et al., 2019) provides procedurally generated text games for RL agent training; our quests are human-authored with organic difficulty distributions.
Chain-of-thought prompting (Wei et al., 2022), ReAct (Yao et al., 2023), and tool-augmented generation (Schick et al., 2023) have each been shown to improve LLM performance on specific task types. Our benchmark evaluates these approaches as points on a spectrum, measuring their relative impact on the same tasks with the same models.
We use interactive fiction quests in the .qm format, originally created for the Space Rangers video game series. Each quest is a finite-state machine with text descriptions at each node and numbered action choices as edges. Quests terminate in binary outcomes (success or failure) after 10-50+ decision steps. The quest engine is space-rangers-quest, a TypeScript interpreter for the .qm format; quests can also be played in a browser.
The quest corpus contains approximately 150 scenarios. We select a subset of 35 English-language quests that span four difficulty categories identified through error analysis:
| Mode | Architecture | What changes |
|---|---|---|
| A | Baseline | Minimal prompt, action number output only |
| B | Prompted reasoning | Structured analysis/reasoning/choice JSON output |
| C | Knowledge-augmented | Mode B + domain knowledge injected into system prompt |
| D | Planner | Plan-maintain-act loop with periodic replanning |
| E | Tool-augmented | Mode B + state tracker, calculator, quest history tools |
We evaluate six mid-tier production models from different providers, selected to represent the capability tier that developers typically deploy in production (as opposed to frontier flagships). All models are accessed through OpenRouter for consistent API interface and billing.
| Provider | Model | OpenRouter ID | ELO | In $/M | Out $/M |
|---|---|---|---|---|---|
| Gemini 3 Flash | google/gemini-3-flash-preview | 1474 | $0.50 | $3.00 | |
| OpenAI | GPT-5.4 Mini | openai/gpt-5.4-mini | 1458 | $0.75 | $4.50 |
| DeepSeek | V3.2 | deepseek/deepseek-v3.2 | 1424 | $0.26 | $0.42 |
| Mistral | Medium 3.1 | mistralai/mistral-medium-3.1 | 1410 | $0.40 | $2.00 |
| Anthropic | Claude Haiku 4.5 | anthropic/claude-haiku-4.5 | 1408 | $1.00 | $5.00 |
| Minimax | M2.5 | minimax/minimax-m2.5 | 1403 | $0.12 | $0.99 |
We run 3 attempts per model-mode-quest cell (matching TextQuests and AgentQuest protocols). For binary success rates at n=3, we report Wilson score confidence intervals.
Preliminary analysis of 261 runs across 10 zero-success quests reveals four failure categories:
| Category | Example quests | Failure mode | Proposed intervention |
|---|---|---|---|
| Spatial/grid | Codebox, Depth | Cannot track spatial state mentally | State tracker tool (Mode E) |
| Combinatorial | Banket, Election | Cannot satisfy multiple constraints | Calculator + constraint tracker (E) |
| Long-horizon | Driver, Prison | Repetition loops after 15-20 steps | Planner with visited-state log (D) |
| Domain scoring | Leonardo, Foncers | Missing knowledge of game mechanics | Knowledge hints (C) |
A repetition rate above 0.6 reliably predicts 0% success rate across all models, suggesting that loop detection could serve as an early stopping criterion.