Skip to content

fix(harness): always emit a final answer before a turn ends#4386

Open
M3gA-Mind wants to merge 2 commits into
tinyhumansai:mainfrom
M3gA-Mind:fix/GH-4093-final-answer
Open

fix(harness): always emit a final answer before a turn ends#4386
M3gA-Mind wants to merge 2 commits into
tinyhumansai:mainfrom
M3gA-Mind:fix/GH-4093-final-answer

Conversation

@M3gA-Mind

@M3gA-Mind M3gA-Mind commented Jul 1, 2026

Copy link
Copy Markdown
Collaborator

Closes #4093

Problem

In the tinyagents turn engine (run_turn_via_tinyagents_session, session/turn/core.rs), the terminal step that computes the turn's reply had a gap. When the loop ran tool calls and then the model yielded a terminating response with empty text (tool_calls > 0, cap not hit), it fell through to else => outcome.text.clone() and returned an empty string — the turn ended silently, leaving the user with nothing after a sequence of tool calls.

Enforcement approach

Add a "must produce a final response" terminal branch that mirrors the existing tool-call-cap checkpoint path:

  • When the loop finishes cleanly with no final text but tools ran, re-prompt the model with tools disabled (FINAL_ANSWER_INSTRUCTION) for a short closing summary of what it did.
  • If that re-prompt also comes back empty, fall back to a deterministic build_deterministic_final_summary() over the turn's tool-call records — so the synthesized message is itself never empty.
  • The extra call's token/cost usage is folded into the turn accounting, exactly like the cap path.

The genuinely-empty case (no text and no tool calls) still surfaces as EmptyProviderResponse — that's a real provider failure, not a missing summary.

Supporting refactor: generalized summarize_iteration_checkpointsummarize_turn_wrapup(…, instruction) so the cap path and the new no-final-answer path share one streaming provider call.

Tests

Two regression tests in turn_tests.rs, both driving a real tool-only turn (tool runs, model then yields empty terminal text, cap not hit):

  • turn_synthesizes_final_answer_when_tool_turn_yields_no_text — asserts the turn returns the model's synthesized wrap-up and history ends on a non-empty assistant message.
  • turn_final_answer_falls_back_to_deterministic_summary_when_reprompt_empty — re-prompt also empty → asserts a non-empty deterministic summary listing the tool, and that it does not claim "tool-call limit".

Caveat

Local test run couldn't complete under heavy shared-target-dir contention from concurrent builds; relying on CI to run the suite. Change is localized to the terminal-step path (coordinated with the other turn-engine issues #4117/#4089 — expect a rebase at merge).

Summary by CodeRabbit

  • New Features

    • Turns that use tools now end with a final assistant message, even if the model returns no closing text.
    • Added a more reliable wrap-up flow that can synthesize a short summary from executed tool results.
  • Bug Fixes

    • Prevented tool-assisted conversations from ending with an empty or dangling response.
    • Improved handling of turn-cap and final-answer wrap-ups so the conversation stays complete and readable.

A turn that ran tool calls and then yielded a terminating response with
empty text (no cap hit, tool_calls > 0) fell through the terminal step
and returned an empty reply — the turn ended silently with nothing shown
to the user (tinyhumansai#4093).

Enforce a must-produce-a-final-response terminal step: when the loop
finishes cleanly with no final text but tools ran, re-prompt the model
(tools disabled) for a closing summary, falling back to a deterministic
summary of the tool calls so the synthesized message is never empty. The
extra call's usage is folded into the turn accounting, mirroring the
existing tool-call-cap checkpoint path.

Closes tinyhumansai#4093
@M3gA-Mind M3gA-Mind requested a review from a team July 1, 2026 18:09
@coderabbitai

coderabbitai Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Warning

Review limit reached

You’ve reached a temporary PR review limit under our Fair Usage Limits Policy.

Your recent review volume is higher than typical usage, so adaptive limits are currently applied.

Next review available in: 36 minutes

Enable usage-based reviews in Billing to review now. Otherwise, wait until the next included review is available.
You're only billed for reviews past your plan's rate limits ($0.25/file).

How can I continue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews.

How do review limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please refer docs for additional details.

Review details
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 11757988-64d3-4fd8-84a3-e514ea88f94a

📥 Commits

Reviewing files that changed from the base of the PR and between a285278 and b9ed1b1.

📒 Files selected for processing (2)
  • src/openhuman/agent/harness/session/turn/core.rs
  • src/openhuman/agent/harness/session/turn_tests.rs
📝 Walkthrough

Walkthrough

The turn-handling harness replaces the fixed checkpoint-instruction wrap-up call with a parameterized summarize_turn_wrapup that accepts an instruction string. A new final-answer instruction and deterministic tool-summary fallback ensure turns that complete tool calls without final text now synthesize a non-empty assistant message instead of ending silently. Tests cover both paths.

Changes

Final answer enforcement

Layer / File(s) Summary
Wrap-up helper signature change
src/openhuman/agent/harness/session/turn/session_io.rs
summarize_iteration_checkpoint is renamed to summarize_turn_wrapup and now takes an instruction: &str parameter appended as a synthetic user message, removing the fixed constant import.
Final-answer instruction and deterministic fallback
src/openhuman/agent/harness/session/turn_checkpoint.rs
Adds FINAL_ANSWER_INSTRUCTION prompt and build_deterministic_final_summary(records) producing a non-empty summary from ToolCallRecord entries, including the empty-record case.
Turn completion wiring
src/openhuman/agent/harness/session/turn/core.rs
The cap-checkpoint path now uses summarize_turn_wrapup with MAX_ITER_CHECKPOINT_INSTRUCTION; a new branch handles empty final text after tool calls by re-prompting with FINAL_ANSWER_INSTRUCTION, folding usage, and synthesizing final_answer from the wrap-up or deterministic fallback before pushing it into history.
Regression tests for final-answer synthesis
src/openhuman/agent/harness/session/turn_tests.rs
New tests verify a non-empty synthesized assistant message when a tool call is followed by empty terminal text, and that a deterministic tool-based summary is used when the re-prompt also returns empty text.

Estimated code review effort: 3 (Moderate) | ~25 minutes

Suggested labels: bug, rust-core, agent

Sequence Diagram(s)

sequenceDiagram
    participant Core as turn/core.rs
    participant SessionIO as session_io.rs
    participant Checkpoint as turn_checkpoint.rs
    participant History

    Core->>Core: outcome.text empty after tool calls
    Core->>SessionIO: summarize_turn_wrapup(FINAL_ANSWER_INSTRUCTION)
    SessionIO-->>Core: summary, usage
    alt summary non-empty
        Core->>Core: final_answer = summary
    else summary empty
        Core->>Checkpoint: build_deterministic_final_summary(tool_records)
        Checkpoint-->>Core: deterministic summary
        Core->>Core: final_answer = deterministic summary
    end
    Core->>History: push final_answer
    Core-->>Core: return final_answer
Loading

Poem

A tool ran fast, then went quite still,
No word to say what filled the till —
So I hopped in with a gentle nudge,
"Speak up now, don't make me judge!"
Now every burrow ends with cheer 🐇
A final word, loud and clear!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly states the primary harness change: ensuring each turn emits a final answer before ending.
Linked Issues check ✅ Passed The PR matches #4093 by forcing a terminal response after tool work, with a model re-prompt and deterministic fallback.
Out of Scope Changes check ✅ Passed The refactor and added tests support the same harness behavior change and don't appear unrelated to #4093.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot added agent Built-in agents, prompts, orchestration, and agent runtime in src/openhuman/agent/. bug rust-core Core Rust runtime in src/: CLI, core_server, shared infrastructure. labels Jul 1, 2026

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a285278c42

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/openhuman/agent/harness/session/turn/core.rs

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/openhuman/agent/harness/session/turn/core.rs (1)

956-956: 🗄️ Data Integrity & Integration | 🟡 Minor | ⚡ Quick win

Drop empty assistant turns before appending the turn transcript
messages_to_conversation keeps an empty Message::Assistant as a ConversationMessage::Chat, so self.history.extend(outcome.conversation.iter().cloned()) can persist a blank assistant turn before the synthesized final_answer/checkpoint is appended. Filter that empty assistant message out here (same guard for the cap path) so the transcript and KV-cache prefix stay clean.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/openhuman/agent/harness/session/turn/core.rs` at line 956, The turn
transcript append in the session core still includes an empty assistant chat
message from messages_to_conversation, which can leave a blank assistant turn in
history before the synthesized final_answer/checkpoint is added. Update the
history extension in core::turn handling to filter out that empty
Message::Assistant/ConversationMessage::Chat entry, and apply the same guard in
the cap-path transcript append so self.history stays clean and the KV-cache
prefix is not polluted.
🧹 Nitpick comments (1)
src/openhuman/agent/harness/session/turn/core.rs (1)

969-1001: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Duplicated wrap-up/fallback/usage-folding logic between hit_cap and no-final-answer branches.

Both branches repeat the same four-step pattern (build base, call summarize_turn_wrapup, fold summary_usage into the four running totals, pick summary vs. a deterministic fallback, push to history). Consider extracting a shared private helper that takes the instruction string and a fallback closure/function, used by both branches.

♻️ Sketch of a shared helper
async fn resolve_turn_wrapup(
    &mut self,
    effective_model: &str,
    iteration_for_stream: u32,
    instruction: &str,
    input_tokens: &mut u64,
    output_tokens: &mut u64,
    cached_input_tokens: &mut u64,
    charged_amount_usd: &mut f64,
    fallback: impl FnOnce() -> String,
) -> String {
    let base = self.tool_dispatcher.to_provider_messages(&self.history);
    let (summary, summary_usage) = self
        .summarize_turn_wrapup(&base, effective_model, iteration_for_stream, instruction)
        .await;
    if let Some(u) = summary_usage {
        *input_tokens += u.input_tokens;
        *output_tokens += u.output_tokens;
        *cached_input_tokens += u.cached_input_tokens;
        *charged_amount_usd += u.charged_amount_usd;
    }
    let result = if summary.trim().is_empty() { fallback() } else { summary };
    self.history
        .push(ConversationMessage::Chat(ChatMessage::assistant(result.clone())));
    result
}

Also applies to: 1011-1052

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/openhuman/agent/harness/session/turn/core.rs` around lines 969 - 1001,
The wrap-up handling in the `hit_cap` and no-final-answer branches duplicates
the same summarize/fold/fallback/push flow, so extract it into a shared private
helper on `TurnCore` (or nearby) and reuse it from both paths. Move the common
`to_provider_messages`, `summarize_turn_wrapup`, usage accumulation,
deterministic fallback selection, and `history.push` logic into that helper,
parameterized by the instruction string and fallback generator, then have both
branches call it with their branch-specific instruction/fallback behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@src/openhuman/agent/harness/session/turn/core.rs`:
- Line 956: The turn transcript append in the session core still includes an
empty assistant chat message from messages_to_conversation, which can leave a
blank assistant turn in history before the synthesized final_answer/checkpoint
is added. Update the history extension in core::turn handling to filter out that
empty Message::Assistant/ConversationMessage::Chat entry, and apply the same
guard in the cap-path transcript append so self.history stays clean and the
KV-cache prefix is not polluted.

---

Nitpick comments:
In `@src/openhuman/agent/harness/session/turn/core.rs`:
- Around line 969-1001: The wrap-up handling in the `hit_cap` and
no-final-answer branches duplicates the same summarize/fold/fallback/push flow,
so extract it into a shared private helper on `TurnCore` (or nearby) and reuse
it from both paths. Move the common `to_provider_messages`,
`summarize_turn_wrapup`, usage accumulation, deterministic fallback selection,
and `history.push` logic into that helper, parameterized by the instruction
string and fallback generator, then have both branches call it with their
branch-specific instruction/fallback behavior.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: f1a5e9bb-0c6f-43b8-8f4f-f265fd6a0b12

📥 Commits

Reviewing files that changed from the base of the PR and between 4c98a31 and a285278.

📒 Files selected for processing (4)
  • src/openhuman/agent/harness/session/turn/core.rs
  • src/openhuman/agent/harness/session/turn/session_io.rs
  • src/openhuman/agent/harness/session/turn_checkpoint.rs
  • src/openhuman/agent/harness/session/turn_tests.rs

coderabbitai[bot]
coderabbitai Bot previously approved these changes Jul 1, 2026
…nswer

The no-final-text path folds the empty terminal assistant response into
history via outcome.conversation before synthesizing a closing summary,
which left a blank assistant turn immediately before the real reply in
the transcript and next prompt. Pop that empty assistant Chat before
appending the synthesized answer.
@M3gA-Mind

Copy link
Copy Markdown
Collaborator Author

Addressed the Codex review (commit b9ed1b1):

  • In the no-final-text terminal branch, the empty terminal assistant response was already folded into history via outcome.conversation. The branch now pops that trailing blank Chat(assistant("")) before appending the synthesized closing summary, so the transcript and next-turn prompt no longer carry a dangling empty assistant message before the real reply.
  • Both [Harness] Always emit a final answer/summary before a turn ends #4093 regression tests now additionally assert that no empty assistant turn remains in history.

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor
🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail
gh pr diff 4386 --repo tinyhumansai/openhuman -- src/openhuman/agent/harness/session/turn/core.rs src/openhuman/agent/harness/session/turn_tests.rs

Length of output: 522


🏁 Script executed:

#!/bin/bash
set -euo pipefail
gh pr diff 4386 --repo tinyhumansai/openhuman > /tmp/pr4386.diff
wc -l /tmp/pr4386.diff
grep -n "^diff --git" /tmp/pr4386.diff

Length of output: 807


🏁 Script executed:

#!/bin/bash
sed -n '1,80p' /tmp/pr4386.diff

Length of output: 4346


🏁 Script executed:

#!/bin/bash
sed -n '172,341p' /tmp/pr4386.diff

Length of output: 6981


The fix looks correct — the trailing empty Chat(assistant("")) is now popped before the synthesized final answer is pushed, and both regression tests assert history contains no blank assistant turns. This matches the earlier Codex review concern precisely. Logic checks out:

  • The matches! guard correctly targets only the last history entry, only when it's an assistant chat message with empty content, so it won't accidentally drop a legitimate prior message.
  • Both new/updated tests (turn_synthesizes_final_answer_when_tool_turn_yields_no_text and turn_final_answer_falls_back_to_deterministic_summary_when_reprompt_empty) now assert no blank assistant message remains anywhere in history, which is a stronger check than just the tail.
  • The summarize_turn_wrapup refactor (shared between the cap path and the new final-answer path) and the FINAL_ANSWER_INSTRUCTION/build_deterministic_final_summary additions all look consistent with the PR objectives for #4093.

No further concerns from my side on this update.

✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@sanil-23 sanil-23 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — #4386 fix(harness): always emit a final answer before a turn ends

Walkthrough. The change closes #4093: a turn that runs tool calls and then terminates with empty final text used to return a blank reply. The fix renames summarize_iteration_checkpointsummarize_turn_wrapup (now instruction-parameterized), adds a FINAL_ANSWER_INSTRUCTION + build_deterministic_final_summary, and adds a new terminal branch in turn/core.rs that re-prompts the model (tools disabled) for a closing summary, falling back to a deterministic tool-call summary when the re-prompt is also empty. The direction is right, it composes cleanly with the #4249 tinyagents outcome (hit_cap/tool_calls/conversation/tool_outcomes), and the extra model call is bounded to the pathological empty-final case (the healthy else { outcome.text } path is untouched, so it is not a spurious extra call on normal turns). Two things worth fixing before merge (one correctness, one coordination), plus minor nits.


🟠 Major

1. base for the re-prompt is built before the blank assistant is popped — the wrap-up call carries a trailing empty assistant turn.
src/openhuman/agent/harness/session/turn/core.rs:1021 (base) vs :1054-1060 (pop).

At line 956 the empty terminal Chat(assistant("")) is appended to self.history. The new branch then builds base from that history at line 1021 — before popping the blank at 1054. to_provider_messages clones ConversationMessage::Chat verbatim (no empty-filtering — src/openhuman/agent/dispatcher.rs:133), so summarize_turn_wrapup sends the provider […, assistant(""), user(FINAL_ANSWER_INSTRUCTION)].

Consequences: (a) a dangling blank assistant turn sits right before the re-prompt; (b) on native-tool providers that reject empty assistant content (Anthropic returns 400 on empty text blocks), the re-prompt call fails, silently degrading to the deterministic fallback and throwing away the model's better summary. The new tests don't catch this because SequenceProvider ignores its input.

Fix: pop the blank before computing base.

} else if outcome.text.trim().is_empty() {
    // Drop the blank terminal assistant BEFORE building the re-prompt base,
    // so the wrap-up call isn't sent a dangling empty assistant turn.
    if matches!(
        self.history.last(),
        Some(ConversationMessage::Chat(msg))
            if msg.role == "assistant" && msg.content.trim().is_empty()
    ) {
        self.history.pop();
    }
    let base = self.tool_dispatcher.to_provider_messages(&self.history);
    let (summary, summary_usage) = self
        .summarize_turn_wrapup(&base, effective_model, outcome.model_calls as u32 + 1,
            super::super::turn_checkpoint::FINAL_ANSWER_INSTRUCTION)
        .await;
    // …fold usage, pick summary/fallback…
    self.history.push(ConversationMessage::Chat(ChatMessage::assistant(final_answer.clone())));
    final_answer
}

2. Direct file-level conflict with sibling PR #4387.
#4387 (enforce required structured-output block every turn) edits the same three files — turn/core.rs, turn/session_io.rs, turn_tests.rs — and post-processes the same terminal assistant region. Beyond a mechanical merge conflict, the two are semantically coupled: #4386 synthesizes a final answer when text is empty; #4387 enforces a structured block on every turn. Whichever lands second must reconcile the ordering (does the required-output enforcement run against the synthesized final_answer?). Please coordinate merge order and add a cross-turn test that exercises both paths together. (#4389/#4390 touch tinyagents/* and are lower-risk, but #4389's no-progress halt writes HaltSummarySlot → non-empty text, so it should pre-empt this branch — worth a confirming test once both land.)


🟡 Minor

3. Missing usage-folding assertion for the new branch. The cap path has turn_checkpoint_usage_is_folded_into_transcript_accounting, but neither new #4093 test attaches usage to the re-prompt response, so the input_tokens += u… fold at core.rs:1030-1035 is uncovered. Add a variant where the forced re-prompt returns usage: Some(...) and assert the turn totals grew.

4. Duplicated wrap-up/fold/fallback/push block across the hit_cap (969-1001) and new (1011-1065) branches (also raised by CodeRabbit). A small private helper taking (instruction, fallback: impl FnOnce() -> String) would remove the four-step repetition. Non-blocking.

5. Log uses outcome.tool_calls but the summary is built from tool_records_from_conversation. If tool_calls counts retries differently from the records list, the "after N tool call(s)" count in the core.rs:1043 log can disagree with the number of bullets in the deterministic summary. Cosmetic, but consider logging records.len() for consistency.


🟢 Nits / looks-good

  • FINAL_ANSWER_INSTRUCTION and build_deterministic_final_summary are well-documented and correctly distinguished from the paused-checkpoint variants (the "must not claim tool-call limit" test is a nice touch).
  • Whitespace-only final text is correctly treated as empty (.trim().is_empty()).
  • Branch ordering is correct: hit_cap → empty+no-tools error → empty+tools synthesize → normal.
  • CodeRabbit's "also guard the cap path" outside-diff note is a non-issue here: the cap-path terminal is AssistantToolCalls, not an empty Chat, so there's no blank to drop.

Nice fix overall — addressing #1 (pop before building base) is the one thing I'd gate merge on.

@sanil-23 sanil-23 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting changes on the strength of the two items in my detailed review above:

Blocking (Major #1): base for the forced final-answer re-prompt is built at turn/core.rs:1021before the blank Chat(assistant("")) is popped at :1054. Since to_provider_messages doesn't filter empties, the wrap-up call is sent a trailing empty assistant turn, which native-tool providers (Anthropic) reject with a 400, silently degrading to the deterministic fallback and discarding the model's better summary. Moving the pop above the base computation fixes it. The new tests don't catch this because SequenceProvider ignores its input, so please also add a provider that fails on an empty trailing assistant message (or assert the re-prompt request omits it).

Coordination (Major #2): direct file-level + semantic conflict with sibling #4387 (same three files, same terminal region). Whichever lands second needs to reconcile ordering; a combined test would de-risk it.

The direction and the deterministic-fallback design are good — happy to flip to approve once #1 is addressed. Minor items (#3#5) are non-blocking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent Built-in agents, prompts, orchestration, and agent runtime in src/openhuman/agent/. bug rust-core Core Rust runtime in src/: CLI, core_server, shared infrastructure.

Projects

Status: Todo

Development

Successfully merging this pull request may close these issues.

[Harness] Always emit a final answer/summary before a turn ends

2 participants