fix(harness): always emit a final answer before a turn ends#4386
fix(harness): always emit a final answer before a turn ends#4386M3gA-Mind wants to merge 2 commits into
Conversation
A turn that ran tool calls and then yielded a terminating response with empty text (no cap hit, tool_calls > 0) fell through the terminal step and returned an empty reply — the turn ended silently with nothing shown to the user (tinyhumansai#4093). Enforce a must-produce-a-final-response terminal step: when the loop finishes cleanly with no final text but tools ran, re-prompt the model (tools disabled) for a closing summary, falling back to a deterministic summary of the tool calls so the synthesized message is never empty. The extra call's usage is folded into the turn accounting, mirroring the existing tool-call-cap checkpoint path. Closes tinyhumansai#4093
|
Warning Review limit reachedYou’ve reached a temporary PR review limit under our Fair Usage Limits Policy. Next review available in: 36 minutes Enable usage-based reviews in Billing to review now. Otherwise, wait until the next included review is available. How can I continue?After more reviews become available, a review can be triggered using the To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews. How do review limits work?CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability. For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window. Please refer docs for additional details. Review details⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughThe turn-handling harness replaces the fixed checkpoint-instruction wrap-up call with a parameterized ChangesFinal answer enforcement
Estimated code review effort: 3 (Moderate) | ~25 minutes Suggested labels: Sequence Diagram(s)sequenceDiagram
participant Core as turn/core.rs
participant SessionIO as session_io.rs
participant Checkpoint as turn_checkpoint.rs
participant History
Core->>Core: outcome.text empty after tool calls
Core->>SessionIO: summarize_turn_wrapup(FINAL_ANSWER_INSTRUCTION)
SessionIO-->>Core: summary, usage
alt summary non-empty
Core->>Core: final_answer = summary
else summary empty
Core->>Checkpoint: build_deterministic_final_summary(tool_records)
Checkpoint-->>Core: deterministic summary
Core->>Core: final_answer = deterministic summary
end
Core->>History: push final_answer
Core-->>Core: return final_answer
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a285278c42
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/openhuman/agent/harness/session/turn/core.rs (1)
956-956: 🗄️ Data Integrity & Integration | 🟡 Minor | ⚡ Quick winDrop empty assistant turns before appending the turn transcript
messages_to_conversationkeeps an emptyMessage::Assistantas aConversationMessage::Chat, soself.history.extend(outcome.conversation.iter().cloned())can persist a blank assistant turn before the synthesizedfinal_answer/checkpoint is appended. Filter that empty assistant message out here (same guard for the cap path) so the transcript and KV-cache prefix stay clean.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/openhuman/agent/harness/session/turn/core.rs` at line 956, The turn transcript append in the session core still includes an empty assistant chat message from messages_to_conversation, which can leave a blank assistant turn in history before the synthesized final_answer/checkpoint is added. Update the history extension in core::turn handling to filter out that empty Message::Assistant/ConversationMessage::Chat entry, and apply the same guard in the cap-path transcript append so self.history stays clean and the KV-cache prefix is not polluted.
🧹 Nitpick comments (1)
src/openhuman/agent/harness/session/turn/core.rs (1)
969-1001: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winDuplicated wrap-up/fallback/usage-folding logic between
hit_capand no-final-answer branches.Both branches repeat the same four-step pattern (build
base, callsummarize_turn_wrapup, foldsummary_usageinto the four running totals, picksummaryvs. a deterministic fallback, push to history). Consider extracting a shared private helper that takes the instruction string and a fallback closure/function, used by both branches.♻️ Sketch of a shared helper
async fn resolve_turn_wrapup( &mut self, effective_model: &str, iteration_for_stream: u32, instruction: &str, input_tokens: &mut u64, output_tokens: &mut u64, cached_input_tokens: &mut u64, charged_amount_usd: &mut f64, fallback: impl FnOnce() -> String, ) -> String { let base = self.tool_dispatcher.to_provider_messages(&self.history); let (summary, summary_usage) = self .summarize_turn_wrapup(&base, effective_model, iteration_for_stream, instruction) .await; if let Some(u) = summary_usage { *input_tokens += u.input_tokens; *output_tokens += u.output_tokens; *cached_input_tokens += u.cached_input_tokens; *charged_amount_usd += u.charged_amount_usd; } let result = if summary.trim().is_empty() { fallback() } else { summary }; self.history .push(ConversationMessage::Chat(ChatMessage::assistant(result.clone()))); result }Also applies to: 1011-1052
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/openhuman/agent/harness/session/turn/core.rs` around lines 969 - 1001, The wrap-up handling in the `hit_cap` and no-final-answer branches duplicates the same summarize/fold/fallback/push flow, so extract it into a shared private helper on `TurnCore` (or nearby) and reuse it from both paths. Move the common `to_provider_messages`, `summarize_turn_wrapup`, usage accumulation, deterministic fallback selection, and `history.push` logic into that helper, parameterized by the instruction string and fallback generator, then have both branches call it with their branch-specific instruction/fallback behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@src/openhuman/agent/harness/session/turn/core.rs`:
- Line 956: The turn transcript append in the session core still includes an
empty assistant chat message from messages_to_conversation, which can leave a
blank assistant turn in history before the synthesized final_answer/checkpoint
is added. Update the history extension in core::turn handling to filter out that
empty Message::Assistant/ConversationMessage::Chat entry, and apply the same
guard in the cap-path transcript append so self.history stays clean and the
KV-cache prefix is not polluted.
---
Nitpick comments:
In `@src/openhuman/agent/harness/session/turn/core.rs`:
- Around line 969-1001: The wrap-up handling in the `hit_cap` and
no-final-answer branches duplicates the same summarize/fold/fallback/push flow,
so extract it into a shared private helper on `TurnCore` (or nearby) and reuse
it from both paths. Move the common `to_provider_messages`,
`summarize_turn_wrapup`, usage accumulation, deterministic fallback selection,
and `history.push` logic into that helper, parameterized by the instruction
string and fallback generator, then have both branches call it with their
branch-specific instruction/fallback behavior.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: f1a5e9bb-0c6f-43b8-8f4f-f265fd6a0b12
📒 Files selected for processing (4)
src/openhuman/agent/harness/session/turn/core.rssrc/openhuman/agent/harness/session/turn/session_io.rssrc/openhuman/agent/harness/session/turn_checkpoint.rssrc/openhuman/agent/harness/session/turn_tests.rs
…nswer The no-final-text path folds the empty terminal assistant response into history via outcome.conversation before synthesizing a closing summary, which left a blank assistant turn immediately before the real reply in the transcript and next prompt. Pop that empty assistant Chat before appending the synthesized answer.
|
Addressed the Codex review (commit b9ed1b1):
@coderabbitai review |
🧩 Analysis chain🏁 Script executed: #!/bin/bash
set -euo pipefail
gh pr diff 4386 --repo tinyhumansai/openhuman -- src/openhuman/agent/harness/session/turn/core.rs src/openhuman/agent/harness/session/turn_tests.rsLength of output: 522 🏁 Script executed: #!/bin/bash
set -euo pipefail
gh pr diff 4386 --repo tinyhumansai/openhuman > /tmp/pr4386.diff
wc -l /tmp/pr4386.diff
grep -n "^diff --git" /tmp/pr4386.diffLength of output: 807 🏁 Script executed: #!/bin/bash
sed -n '1,80p' /tmp/pr4386.diffLength of output: 4346 🏁 Script executed: #!/bin/bash
sed -n '172,341p' /tmp/pr4386.diffLength of output: 6981 The fix looks correct — the trailing empty
No further concerns from my side on this update. ✅ Action performedReview finished.
|
sanil-23
left a comment
There was a problem hiding this comment.
Review — #4386 fix(harness): always emit a final answer before a turn ends
Walkthrough. The change closes #4093: a turn that runs tool calls and then terminates with empty final text used to return a blank reply. The fix renames summarize_iteration_checkpoint → summarize_turn_wrapup (now instruction-parameterized), adds a FINAL_ANSWER_INSTRUCTION + build_deterministic_final_summary, and adds a new terminal branch in turn/core.rs that re-prompts the model (tools disabled) for a closing summary, falling back to a deterministic tool-call summary when the re-prompt is also empty. The direction is right, it composes cleanly with the #4249 tinyagents outcome (hit_cap/tool_calls/conversation/tool_outcomes), and the extra model call is bounded to the pathological empty-final case (the healthy else { outcome.text } path is untouched, so it is not a spurious extra call on normal turns). Two things worth fixing before merge (one correctness, one coordination), plus minor nits.
🟠 Major
1. base for the re-prompt is built before the blank assistant is popped — the wrap-up call carries a trailing empty assistant turn.
src/openhuman/agent/harness/session/turn/core.rs:1021 (base) vs :1054-1060 (pop).
At line 956 the empty terminal Chat(assistant("")) is appended to self.history. The new branch then builds base from that history at line 1021 — before popping the blank at 1054. to_provider_messages clones ConversationMessage::Chat verbatim (no empty-filtering — src/openhuman/agent/dispatcher.rs:133), so summarize_turn_wrapup sends the provider […, assistant(""), user(FINAL_ANSWER_INSTRUCTION)].
Consequences: (a) a dangling blank assistant turn sits right before the re-prompt; (b) on native-tool providers that reject empty assistant content (Anthropic returns 400 on empty text blocks), the re-prompt call fails, silently degrading to the deterministic fallback and throwing away the model's better summary. The new tests don't catch this because SequenceProvider ignores its input.
Fix: pop the blank before computing base.
} else if outcome.text.trim().is_empty() {
// Drop the blank terminal assistant BEFORE building the re-prompt base,
// so the wrap-up call isn't sent a dangling empty assistant turn.
if matches!(
self.history.last(),
Some(ConversationMessage::Chat(msg))
if msg.role == "assistant" && msg.content.trim().is_empty()
) {
self.history.pop();
}
let base = self.tool_dispatcher.to_provider_messages(&self.history);
let (summary, summary_usage) = self
.summarize_turn_wrapup(&base, effective_model, outcome.model_calls as u32 + 1,
super::super::turn_checkpoint::FINAL_ANSWER_INSTRUCTION)
.await;
// …fold usage, pick summary/fallback…
self.history.push(ConversationMessage::Chat(ChatMessage::assistant(final_answer.clone())));
final_answer
}2. Direct file-level conflict with sibling PR #4387.
#4387 (enforce required structured-output block every turn) edits the same three files — turn/core.rs, turn/session_io.rs, turn_tests.rs — and post-processes the same terminal assistant region. Beyond a mechanical merge conflict, the two are semantically coupled: #4386 synthesizes a final answer when text is empty; #4387 enforces a structured block on every turn. Whichever lands second must reconcile the ordering (does the required-output enforcement run against the synthesized final_answer?). Please coordinate merge order and add a cross-turn test that exercises both paths together. (#4389/#4390 touch tinyagents/* and are lower-risk, but #4389's no-progress halt writes HaltSummarySlot → non-empty text, so it should pre-empt this branch — worth a confirming test once both land.)
🟡 Minor
3. Missing usage-folding assertion for the new branch. The cap path has turn_checkpoint_usage_is_folded_into_transcript_accounting, but neither new #4093 test attaches usage to the re-prompt response, so the input_tokens += u… fold at core.rs:1030-1035 is uncovered. Add a variant where the forced re-prompt returns usage: Some(...) and assert the turn totals grew.
4. Duplicated wrap-up/fold/fallback/push block across the hit_cap (969-1001) and new (1011-1065) branches (also raised by CodeRabbit). A small private helper taking (instruction, fallback: impl FnOnce() -> String) would remove the four-step repetition. Non-blocking.
5. Log uses outcome.tool_calls but the summary is built from tool_records_from_conversation. If tool_calls counts retries differently from the records list, the "after N tool call(s)" count in the core.rs:1043 log can disagree with the number of bullets in the deterministic summary. Cosmetic, but consider logging records.len() for consistency.
🟢 Nits / looks-good
FINAL_ANSWER_INSTRUCTIONandbuild_deterministic_final_summaryare well-documented and correctly distinguished from the paused-checkpoint variants (the "must not claim tool-call limit" test is a nice touch).- Whitespace-only final text is correctly treated as empty (
.trim().is_empty()). - Branch ordering is correct:
hit_cap→ empty+no-tools error → empty+tools synthesize → normal. - CodeRabbit's "also guard the cap path" outside-diff note is a non-issue here: the cap-path terminal is
AssistantToolCalls, not an emptyChat, so there's no blank to drop.
Nice fix overall — addressing #1 (pop before building base) is the one thing I'd gate merge on.
sanil-23
left a comment
There was a problem hiding this comment.
Requesting changes on the strength of the two items in my detailed review above:
Blocking (Major #1): base for the forced final-answer re-prompt is built at turn/core.rs:1021 — before the blank Chat(assistant("")) is popped at :1054. Since to_provider_messages doesn't filter empties, the wrap-up call is sent a trailing empty assistant turn, which native-tool providers (Anthropic) reject with a 400, silently degrading to the deterministic fallback and discarding the model's better summary. Moving the pop above the base computation fixes it. The new tests don't catch this because SequenceProvider ignores its input, so please also add a provider that fails on an empty trailing assistant message (or assert the re-prompt request omits it).
Coordination (Major #2): direct file-level + semantic conflict with sibling #4387 (same three files, same terminal region). Whichever lands second needs to reconcile ordering; a combined test would de-risk it.
The direction and the deterministic-fallback design are good — happy to flip to approve once #1 is addressed. Minor items (#3–#5) are non-blocking.
Closes #4093
Problem
In the tinyagents turn engine (
run_turn_via_tinyagents_session,session/turn/core.rs), the terminal step that computes the turn'sreplyhad a gap. When the loop ran tool calls and then the model yielded a terminating response with empty text (tool_calls > 0, cap not hit), it fell through toelse => outcome.text.clone()and returned an empty string — the turn ended silently, leaving the user with nothing after a sequence of tool calls.Enforcement approach
Add a "must produce a final response" terminal branch that mirrors the existing tool-call-cap checkpoint path:
FINAL_ANSWER_INSTRUCTION) for a short closing summary of what it did.build_deterministic_final_summary()over the turn's tool-call records — so the synthesized message is itself never empty.The genuinely-empty case (no text and no tool calls) still surfaces as
EmptyProviderResponse— that's a real provider failure, not a missing summary.Supporting refactor: generalized
summarize_iteration_checkpoint→summarize_turn_wrapup(…, instruction)so the cap path and the new no-final-answer path share one streaming provider call.Tests
Two regression tests in
turn_tests.rs, both driving a real tool-only turn (tool runs, model then yields empty terminal text, cap not hit):turn_synthesizes_final_answer_when_tool_turn_yields_no_text— asserts the turn returns the model's synthesized wrap-up and history ends on a non-empty assistant message.turn_final_answer_falls_back_to_deterministic_summary_when_reprompt_empty— re-prompt also empty → asserts a non-empty deterministic summary listing the tool, and that it does not claim "tool-call limit".Caveat
Local test run couldn't complete under heavy shared-target-dir contention from concurrent builds; relying on CI to run the suite. Change is localized to the terminal-step path (coordinated with the other turn-engine issues #4117/#4089 — expect a rebase at merge).
Summary by CodeRabbit
New Features
Bug Fixes