Draft - Work in progress - Results are preliminary

Does Agent Architecture Matter? Evaluating LLM Decision-Making in Text-Based Environments

Kirill Korikov

Code | Leaderboard | Data

Abstract

We present LLM-Quest, a benchmark for evaluating how agent architecture affects LLM decision-making in text-based interactive environments. Unlike most LLM benchmarks that vary tasks and models while treating the agent as a black box, we introduce agent architecture as a first-class experimental dimension. We evaluate six mid-tier production language models across five agent modes -- baseline prompting, structured reasoning, domain knowledge injection, multi-step planning, and tool-augmented agents -- on a corpus of branching-narrative quests that require sequential decision-making over 10-50+ turns. Our three-dimensional evaluation (models x quests x agent modes) reveals [preliminary: the relationship between cognitive scaffolding and task performance, including which architectural interventions help which categories of tasks]. We release the full benchmark framework, evaluation data, and an interactive leaderboard to support reproducible agent architecture research.

1. Introduction

The rapid improvement in large language model capabilities has produced a parallel expansion in agent frameworks: chain-of-thought prompting, ReAct-style reasoning, tool use, and multi-step planning are now standard components of production LLM systems. Yet most evaluation benchmarks compare models on a fixed set of tasks using a single agent architecture, collapsing the architecture dimension into the model comparison.

This conflation obscures a practical question that matters to practitioners: given a fixed model and a fixed task, how much does the agent wrapper matter? Is a $0.02/run model with planning competitive against a $0.04/run model without it? Does domain knowledge injection unlock tasks that pure reasoning cannot solve?

We address this gap with LLM-Quest, a benchmark that evaluates LLM agents along three dimensions simultaneously: model (six mid-tier production models from different providers), task (a corpus of branching-narrative quests with diverse difficulty), and agent architecture (five modes ranging from bare prompting to tool-augmented agents).

Contributions

2. Related work

LLM agent benchmarks

AgentBench (Liu et al., 2023) evaluates LLM agents across eight environments including web browsing, code execution, and database operations. SWE-bench (Jimenez et al., 2024) focuses on software engineering tasks, measuring an agent's ability to resolve real GitHub issues. AgentQuest (Gioacchini et al., NAACL 2024) provides a modular framework for agent evaluation with pluggable environments. These benchmarks compare models on a fixed agent architecture; we vary the architecture as an independent variable.

Text-based evaluation environments

TextQuests (Bakaeva et al., 2025) uses a similar quest corpus (Space Rangers .qm format) for LLM evaluation, but focuses on model comparison with a single agent architecture. Our work extends this approach by adding the agent mode dimension and comparing five architectures per model. TextWorld (Cote et al., 2019) provides procedurally generated text games for RL agent training; our quests are human-authored with organic difficulty distributions.

Agent architectures

Chain-of-thought prompting (Wei et al., 2022), ReAct (Yao et al., 2023), and tool-augmented generation (Schick et al., 2023) have each been shown to improve LLM performance on specific task types. Our benchmark evaluates these approaches as points on a spectrum, measuring their relative impact on the same tasks with the same models.

3. Method

3.1 Task environment

We use interactive fiction quests in the .qm format, originally created for the Space Rangers video game series. Each quest is a finite-state machine with text descriptions at each node and numbered action choices as edges. Quests terminate in binary outcomes (success or failure) after 10-50+ decision steps. The quest engine is space-rangers-quest, a TypeScript interpreter for the .qm format; quests can also be played in a browser.

The quest corpus contains approximately 150 scenarios. We select a subset of 35 English-language quests that span four difficulty categories identified through error analysis:

3.2 Agent modes

ModeArchitectureWhat changes
ABaselineMinimal prompt, action number output only
BPrompted reasoningStructured analysis/reasoning/choice JSON output
CKnowledge-augmentedMode B + domain knowledge injected into system prompt
DPlannerPlan-maintain-act loop with periodic replanning
ETool-augmentedMode B + state tracker, calculator, quest history tools

3.3 Models

We evaluate six mid-tier production models from different providers, selected to represent the capability tier that developers typically deploy in production (as opposed to frontier flagships). All models are accessed through OpenRouter for consistent API interface and billing.

ProviderModelOpenRouter IDELOIn $/MOut $/M
GoogleGemini 3 Flashgoogle/gemini-3-flash-preview1474$0.50$3.00
OpenAIGPT-5.4 Miniopenai/gpt-5.4-mini1458$0.75$4.50
DeepSeekV3.2deepseek/deepseek-v3.21424$0.26$0.42
MistralMedium 3.1mistralai/mistral-medium-3.11410$0.40$2.00
AnthropicClaude Haiku 4.5anthropic/claude-haiku-4.51408$1.00$5.00
MinimaxM2.5minimax/minimax-m2.51403$0.12$0.99

3.4 Metrics

We run 3 attempts per model-mode-quest cell (matching TextQuests and AgentQuest protocols). For binary success rates at n=3, we report Wilson score confidence intervals.

4. Results

[Pending: full benchmark results. Preliminary data from 360 runs across 6 models and 4 modes is available on the leaderboard. Mode E data collection is in progress.]

4.1 Mode comparison

[Pending: aggregate success rates per mode, across all models and quests. Expected finding: prompted reasoning (B) substantially outperforms baseline (A); knowledge injection (C) and planning (D) provide additional gains on specific quest categories.]

4.2 Per-category analysis

[Pending: success rates broken down by quest difficulty category (spatial, combinatorial, long-horizon, domain). Expected finding: different architectural interventions help different categories -- tools (E) for spatial/combinatorial, planning (D) for long-horizon, knowledge (C) for domain-specific.]

4.3 Error taxonomy

Preliminary analysis of 261 runs across 10 zero-success quests reveals four failure categories:

CategoryExample questsFailure modeProposed intervention
Spatial/gridCodebox, DepthCannot track spatial state mentallyState tracker tool (Mode E)
CombinatorialBanket, ElectionCannot satisfy multiple constraintsCalculator + constraint tracker (E)
Long-horizonDriver, PrisonRepetition loops after 15-20 stepsPlanner with visited-state log (D)
Domain scoringLeonardo, FoncersMissing knowledge of game mechanicsKnowledge hints (C)

A repetition rate above 0.6 reliably predicts 0% success rate across all models, suggesting that loop detection could serve as an early stopping criterion.

4.4 Cost-performance analysis

[Pending: Pareto frontier of success rate vs. cost per run across model-mode combinations. Key question: does a cheap model with good scaffolding beat an expensive model with minimal scaffolding?]

5. Discussion

[Pending full results. Discussion will cover: ]

6. Limitations

7. Conclusion

[Pending full results.]

Citation

@misc{korikov2026llmquest, title={Does Agent Architecture Matter? Evaluating LLM Decision-Making in Text-Based Environments}, author={Korikov, Kirill}, year={2026}, url={https://yourconscience.github.io/llm_quest_benchmark/} }

References