Does Agent Architecture Matter? Evaluating LLM Decision-Making in Text-Based Environments

Kirill Korikov

Abstract

We present LLM-Quest, a benchmark for evaluating how agent architecture affects LLM decision-making in text-based interactive environments. Unlike most LLM benchmarks that vary tasks and models while treating the agent as a black box, we introduce agent architecture as a first-class experimental dimension. We evaluate six mid-tier production language models across five agent modes -- baseline prompting, structured reasoning, domain knowledge injection, multi-step planning, and tool-augmented agents -- on a corpus of branching-narrative quests that require sequential decision-making over 10-50+ turns. Our three-dimensional evaluation (models x quests x agent modes) reveals [preliminary: the relationship between cognitive scaffolding and task performance, including which architectural interventions help which categories of tasks]. We release the full benchmark framework, evaluation data, and an interactive leaderboard to support reproducible agent architecture research.

1. Introduction

The rapid improvement in large language model capabilities has produced a parallel expansion in agent frameworks: chain-of-thought prompting, ReAct-style reasoning, tool use, and multi-step planning are now standard components of production LLM systems. Yet most evaluation benchmarks compare models on a fixed set of tasks using a single agent architecture, collapsing the architecture dimension into the model comparison.

This conflation obscures a practical question that matters to practitioners: given a fixed model and a fixed task, how much does the agent wrapper matter? Is a $0.02/run model with planning competitive against a $0.04/run model without it? Does domain knowledge injection unlock tasks that pure reasoning cannot solve?

We address this gap with LLM-Quest, a benchmark that evaluates LLM agents along three dimensions simultaneously: model (six mid-tier production models from different providers), task (a corpus of branching-narrative quests with diverse difficulty), and agent architecture (five modes ranging from bare prompting to tool-augmented agents).

Contributions

A benchmark framework that treats agent architecture as an explicit experimental variable alongside model and task selection
Five agent modes (A-E) with increasing cognitive scaffolding, implemented as composable prompt templates and agent classes
Evaluation of six mid-tier production models at comparable price points ($0.01-0.04/run), enabling fair comparison within a capability tier
An error taxonomy mapping failure modes to architectural interventions
An open-source framework with YAML-driven configuration, cost tracking, and an interactive leaderboard

2. Related work

LLM agent benchmarks

AgentBench (Liu et al., 2023) evaluates LLM agents across eight environments including web browsing, code execution, and database operations. SWE-bench (Jimenez et al., 2024) focuses on software engineering tasks, measuring an agent's ability to resolve real GitHub issues. AgentQuest (Gioacchini et al., NAACL 2024) provides a modular framework for agent evaluation with pluggable environments. These benchmarks compare models on a fixed agent architecture; we vary the architecture as an independent variable.

Text-based evaluation environments

TextQuests (Bakaeva et al., 2025) uses a similar quest corpus (Space Rangers .qm format) for LLM evaluation, but focuses on model comparison with a single agent architecture. Our work extends this approach by adding the agent mode dimension and comparing five architectures per model. TextWorld (Cote et al., 2019) provides procedurally generated text games for RL agent training; our quests are human-authored with organic difficulty distributions.

Agent architectures

Chain-of-thought prompting (Wei et al., 2022), ReAct (Yao et al., 2023), and tool-augmented generation (Schick et al., 2023) have each been shown to improve LLM performance on specific task types. Our benchmark evaluates these approaches as points on a spectrum, measuring their relative impact on the same tasks with the same models.

3. Method

3.1 Task environment

We use interactive fiction quests in the .qm format, originally created for the Space Rangers video game series. Each quest is a finite-state machine with text descriptions at each node and numbered action choices as edges. Quests terminate in binary outcomes (success or failure) after 10-50+ decision steps. The quest engine is space-rangers-quest, a TypeScript interpreter for the .qm format; quests can also be played in a browser.

The quest corpus contains approximately 150 scenarios. We select a subset of 35 English-language quests that span four difficulty categories identified through error analysis:

Spatial/grid puzzles: require tracking coordinates or tile states
Combinatorial optimization: require satisfying multiple constraints simultaneously
Long-horizon navigation: require sustained strategy over 30+ steps without loops
Domain-specific scoring: require knowledge of game mechanics not present in the quest text

3.2 Agent modes

Mode	Architecture	What changes
A	Baseline	Minimal prompt, action number output only
B	Prompted reasoning	Structured analysis/reasoning/choice JSON output
C	Knowledge-augmented	Mode B + domain knowledge injected into system prompt
D	Planner	Plan-maintain-act loop with periodic replanning
E	Tool-augmented	Mode B + state tracker, calculator, quest history tools

3.3 Models

We evaluate six mid-tier production models from different providers, selected to represent the capability tier that developers typically deploy in production (as opposed to frontier flagships). All models are accessed through OpenRouter for consistent API interface and billing.

Provider	Model	OpenRouter ID	ELO	In $/M	Out $/M
Google	Gemini 3 Flash	google/gemini-3-flash-preview	1474	$0.50	$3.00
OpenAI	GPT-5.4 Mini	openai/gpt-5.4-mini	1458	$0.75	$4.50
DeepSeek	V3.2	deepseek/deepseek-v3.2	1424	$0.26	$0.42
Mistral	Medium 3.1	mistralai/mistral-medium-3.1	1410	$0.40	$2.00
Anthropic	Claude Haiku 4.5	anthropic/claude-haiku-4.5	1408	$1.00	$5.00
Minimax	M2.5	minimax/minimax-m2.5	1403	$0.12	$0.99

3.4 Metrics

Success rate: fraction of runs ending in quest success (binary, per run)
Exploration rate: unique locations visited / total locations in quest graph
Repetition rate: fraction of steps where the agent repeats an action from the last 5 steps
Cost per run: total token cost (input + output) in USD

We run 3 attempts per model-mode-quest cell (matching TextQuests and AgentQuest protocols). For binary success rates at n=3, we report Wilson score confidence intervals.

4. Results

[Pending: full benchmark results. Preliminary data from 360 runs across 6 models and 4 modes is available on the leaderboard. Mode E data collection is in progress.]

4.1 Mode comparison

[Pending: aggregate success rates per mode, across all models and quests. Expected finding: prompted reasoning (B) substantially outperforms baseline (A); knowledge injection (C) and planning (D) provide additional gains on specific quest categories.]

4.2 Per-category analysis

[Pending: success rates broken down by quest difficulty category (spatial, combinatorial, long-horizon, domain). Expected finding: different architectural interventions help different categories -- tools (E) for spatial/combinatorial, planning (D) for long-horizon, knowledge (C) for domain-specific.]

4.3 Error taxonomy

Preliminary analysis of 261 runs across 10 zero-success quests reveals four failure categories:

Category	Example quests	Failure mode	Proposed intervention
Spatial/grid	Codebox, Depth	Cannot track spatial state mentally	State tracker tool (Mode E)
Combinatorial	Banket, Election	Cannot satisfy multiple constraints	Calculator + constraint tracker (E)
Long-horizon	Driver, Prison	Repetition loops after 15-20 steps	Planner with visited-state log (D)
Domain scoring	Leonardo, Foncers	Missing knowledge of game mechanics	Knowledge hints (C)

A repetition rate above 0.6 reliably predicts 0% success rate across all models, suggesting that loop detection could serve as an early stopping criterion.

4.4 Cost-performance analysis

[Pending: Pareto frontier of success rate vs. cost per run across model-mode combinations. Key question: does a cheap model with good scaffolding beat an expensive model with minimal scaffolding?]

5. Discussion

[Pending full results. Discussion will cover:

Whether agent architecture matters more than model choice (the headline finding)
Which modes help which quest categories, and why
Knowledge injection vs. tools vs. planning as complementary strategies
Repetition rate as a failure predictor and early stopping signal
Emergent behaviors (e.g., Claude explicitly declining to solve sliding puzzles)

]

6. Limitations

Binary outcomes: success/failure gives limited signal per run. Partial-credit scoring could improve sensitivity.
No human baselines: no per-quest human completion data exists. The best available anchor is a Steam achievement (3.7% of players complete 30 quests).
English-only main results: the quest corpus includes Russian originals, but main evaluation uses English translations.
Mid-tier models only: we deliberately target production-tier models. Frontier models (GPT-5.4, Claude Opus, Gemini Pro) may show different architecture sensitivity.
Small n per cell: 3 runs per configuration limits statistical power for individual cells, though aggregate patterns across quests are robust.

7. Conclusion

[Pending full results.]

Citation

@misc{korikov2026llmquest, title={Does Agent Architecture Matter? Evaluating LLM Decision-Making in Text-Based Environments}, author={Korikov, Kirill}, year={2026}, url={https://yourconscience.github.io/llm_quest_benchmark/} }

References

Phan, L. et al. (2025). TextQuests: Benchmarking LLM Agents on Text Adventure Games. arXiv:2507.23701.
Cote, M.-A. et al. (2019). TextWorld: A Learning Environment for Text-Based Games. CGW@IJCAI.
Gioacchini, L. et al. (2024). AgentQuest: A Modular Benchmark Framework to Measure Progress and Improve LLM Agents. NAACL.
Jimenez, C. E. et al. (2024). SWE-bench: Can Language Models Resolve Real-World GitHub Issues? ICLR.
Liu, X. et al. (2023). AgentBench: Evaluating LLMs as Agents. ICLR.
Schick, T. et al. (2023). Toolformer: Language Models Can Teach Themselves to Use Tools. NeurIPS.
Wei, J. et al. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. NeurIPS.
Yao, S. et al. (2023). ReAct: Synergizing Reasoning and Acting in Language Models. ICLR.