A self-contained implementation of Reasoning Inception (ReIn), a test-time intervention method that enables conversational agents to recover from contextually flawed interactions — without modifying the agent's parameters or system prompt.
Disclaimer: We provide a minimal implementation of the proposed method to facilitate future use, since this work builds on a benchmark with a complicated codebase.
When LLM-based agents interact with users through multi-turn tool-calling conversations, it is inevitable to encounter user-side errors. We assume two types of user-side errors in this work:
- Ambiguous requests — the user's intent is unclear (e.g., "cancel it" could mean different things)
- Unsupported requests — the user asks for something the system cannot do (e.g., creating a wishlist in a flight reservation scenario)
ReIn addresses these errors through a two-stage mechanism applied at test time:
┌─────────────────────────────────────────────────────────────┐
│ ReIn │
│ │
│ Stage 1: INCEPTION MODULE │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ External LLM analyzes the dialogue context: │ │
│ │ - Are there error signals? (user frustration, etc.) │ │
│ │ - If YES → generate recovery reasoning │ │
│ │ - If NO → do nothing │ │
│ └───────────────────────────────────────────────────────┘ │
│ │ │
│ error detected? │
│ / \ │
│ YES NO │
│ / \ │
│ Stage 2: INJECT inception_block proceed normally │
│ ┌──────────────────────┐ │
│ │ Plant a "seed" of │ │
│ │ recovery reasoning │ │
│ │ into the agent's │ │
│ │ initial context │ │
│ └──────────────────────┘ │
│ │ │
│ Stage 3: AGENT ACTION LOOP │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ The task agent generates actions with the augmented │ │
│ │ context. It may call tools or respond to the user. │ │
│ │ The injected reasoning guides it toward recovery │ │
│ │ actions (e.g., filing an ambiguity report, or │ │
│ │ transferring to a human agent). │ │
│ └───────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
The key innovation is that ReIn works without modifying the agent's system prompt or model weights. Instead, it injects a single reasoning block (via the think tool) into the agent's internal context, which steers the agent's subsequent behavior through the instruction hierarchy.
As discussed in Section 4.6, defining a recovery tool is a key to override instruction hierarchy, which is safer solution given that tool schema definition can only be done by service providers.
├── src/
│ ├── agents.py # LLM wrappers: TaskAgent, InceptionModule, UserSimulator
│ ├── environment.py # Simulated retail environment: database, tools, schemas
│ ├── rein.py # Core Algorithm 1: run_agent_turn() + run_conversation()
│ └── run.py # CLI entry point
├── assets/
│ ├── errors.json # Error taxonomy (paper Appendix D.1.1)
│ ├── plans.json # Recovery plans (paper Appendix D.1.2)
│ ├── prompt_rein.md # Inception module prompt (paper Appendix D.5)
│ ├── prompt_usersim.md # User simulator prompt (paper Appendix D.6)
│ └── prompt_system_retail.md # Task agent system prompt (paper Appendix D.7.2)
├── scenarios.json # 4 test scenarios
├── run_all.sh # Runs all scenarios in ReIn and baseline modes
└── pyproject.toml # uv project configuration
Reading order: Start with src/rein.py — it implements the core algorithm and is the paper's main contribution. Then src/agents.py for the LLM actors, and src/environment.py for the simulated tool environment.
Requires uv.
# Install dependencies into a local .venv
uv sync
# Set your API key (for OpenAI)
export OPENAI_API_KEY="sk-..."Works with any OpenAI-compatible API endpoint (OpenAI, vLLM, etc.).
# Run all scenarios with ReIn enabled
uv run python src/run.py
# Run baseline (no ReIn) for comparison
uv run python src/run.py --no-rein
# Run a single scenario with verbose output
uv run python src/run.py --scenario ambiguous_anaphora_02 --verbose
# Use different models
uv run python src/run.py --agent-model gpt-4o --inception-model gpt-4o-mini
# Use a local model via vLLM (requires --enable-auto-tool-choice --tool-call-parser hermes)
uv run python src/run.py --base-url http://localhost:8000/v1 --agent-model Qwen/Qwen3-32B# Run all scenarios (ReIn + baseline)
bash run_all.sh
# Override model or server via environment variables
BASE_URL=http://localhost:8000/v1 AGENT_MODEL=Qwen/Qwen3-32B bash run_all.sh
VERBOSE=--verbose bash run_all.shvLLM requires tool-calling flags to be enabled at startup:
vllm serve Qwen/Qwen3-32B \
--enable-auto-tool-choice \
--tool-call-parser hermes \
--tensor-parallel-size 4Without --enable-auto-tool-choice and --tool-call-parser, the server will reject requests that use tools.
Each scenario provides an initial 3-turn context {u₁, a₁, u₂} where the agent has already made an error. ReIn intervenes at the next turn to guide recovery.
Results below are averaged over 5 runs at temperature 0.7 using Qwen/Qwen3-32B on vLLM.
| ID | Error Type | Subtype | Description | ReIn (Pass@5) | Baseline (Pass@5) |
|---|---|---|---|---|---|
ambiguous_anaphora_02 |
Ambiguous | Anaphora | "the item" is ambiguous — order has both a mouse and a lamp | 4/5 (80%) | 0/5 (0%) |
ambiguous_multiple_interpretation_02 |
Ambiguous | Multiple interpretation | "update my order" — agent picks shipping address, user meant payment method | 4/5 (80%) | 1/5 (20%) |
unsupported_action_02 |
Unsupported | Action | User asks to set up automatic monthly reorders | 5/5 (100%) | 5/5 (100%) |
unsupported_parameter_02 |
Unsupported | Parameter | User wants store credit as refund method (not supported) | 4/5 (80%) | 1/5 (20%) |
| Average | 85% | 35% |
| Paper concept | Code location |
|---|---|
| Algorithm 1 | src/rein.py:run_agent_turn() |
| Inception module F | src/agents.py:InceptionModule |
| Task agent π_c | src/agents.py:TaskAgent |
| think[ρ_t] injection | src/rein.py:run_agent_turn() lines 60-99 |
| Recovery tools (ambiguity_report, transfer_to_human_agents) | src/environment.py:RETAIL_TOOL_SCHEMAS |
| Error definitions (Φ) | assets/errors.json |
| Recovery plans | assets/plans.json |
| Inception prompt (S') | assets/prompt_rein.md |
| User simulator prompt | assets/prompt_usersim.md |
| Task agent system prompt | assets/prompt_system_retail.md |
| Multi-turn conversation loop | src/rein.py:run_conversation() |
Pass/fail is determined by whether the expected recovery tool was called at any point in the conversation:
| Error type | Expected recovery tool | Rationale |
|---|---|---|
| Ambiguous | ambiguity_report |
Logs the misunderstanding to the internal system (paper §3.2.2) |
| Unsupported | transfer_to_human_agents |
Escalates requests beyond the agent's capability (paper §3.2.2) |
This matches the paper's Pass@1 criterion: the task is considered complete if the agent invokes the correct recovery tool at least once during the session.
@inproceedings{kim2026rein,
title={ReIn: Conversational Error Recovery with Reasoning Inception},
author={Takyoung Kim and Jinseok Nam and Chandrayee Basu and Xing Fan and Chengyuan Ma and Heng Ji and Gokhan Tur and Dilek Hakkani-T{\"u}r},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=4J3kkHI6m5}
}