fix(agent-harness): enforce required structured-output block every turn#4387
fix(agent-harness): enforce required structured-output block every turn#4387M3gA-Mind wants to merge 3 commits into
Conversation
…rn (tinyhumansai#4117) Agents frequently omit a mandated JSON block (e.g. a `thoughts` block) that the contract says must be emitted every turn, leaving downstream parsing with nothing. Add an opt-in `RequiredOutputContract` (block key + required sibling keys) on `AgentConfig`. When set, the turn engine validates the final reply and repairs an omitted block before the turn is accepted: 1. one corrective re-prompt with tools disabled, then 2. a synthesized minimal valid block prepended to the prose as a deterministic fallback. The trailing assistant message is rewritten to match and the repair call's usage is folded into the turn accounting. `None` (default) is a no-op, so existing agents are unaffected. Validation reuses the harness JSON extractor. Adds unit tests for the validator/synthesizer and turn regressions for the omitted-block -> repaired and omitted-block -> synthesized paths.
|
Warning Review limit reachedYou’ve reached a temporary PR review limit under our Fair Usage Limits Policy. Next review available in: 34 minutes Enable usage-based reviews in Billing to review now. Otherwise, wait until the next included review is available. How can I continue?After more reviews become available, a review can be triggered using the To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews. How do review limits work?CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability. For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window. Please refer docs for additional details. Review details⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (6)
📝 WalkthroughWalkthroughAdds a ChangesRequired Output Contract Enforcement
Estimated code review effort: 3 (Moderate) | ~25 minutes Sequence Diagram(s)sequenceDiagram
participant Turn as Turn Finalization
participant Agent
participant Provider
participant History
Turn->>Agent: enforce_required_output(reply, contract)
Agent->>Agent: output_satisfies_contract(reply)?
alt contract violated
Agent->>Provider: reprompt_for_required_block(repair_instruction)
Provider-->>Agent: repaired text, usage
Agent->>Agent: recheck contract
alt still violated
Agent->>Agent: synthesize_block(contract), prepend to text
end
Agent-->>Turn: repaired text, usage
Turn->>History: replace_last_assistant_reply(repaired text)
else contract satisfied
Agent-->>Turn: None
end
Suggested labels: Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 852b7cef5c
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (1)
src/openhuman/agent/harness/session/turn_tests.rs (1)
998-1005: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winPlease exercise the new repair-usage accounting path here.
Both repair responses use
usage: None, so these regressions never cover the new token/cost folding added insrc/openhuman/agent/harness/session/turn/core.rs. A break in repair accounting would still pass this suite.Also applies to: 1064-1069
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/openhuman/agent/harness/session/turn_tests.rs` around lines 998 - 1005, Update the repair-response tests in turn_tests.rs so they actually exercise the new repair-usage accounting path instead of only using ChatResponse with usage: None. Add repair responses with populated usage data in the relevant test cases (including the one around the existing ChatResponse fixture and the matching case near the second referenced block) so the folding logic in turn/core.rs is covered and regressions in repair token/cost accounting are detected.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/openhuman/agent/harness/required_output.rs`:
- Around line 34-55: `find_required_block()` is too permissive because it
accepts the first matching JSON object anywhere in the reply, which lets
`output_satisfies_contract()` pass when the required block is buried after other
text. Update `find_required_block` (and any caller logic in
`output_satisfies_contract`/`enforce_required_output`) to only accept a required
block when it appears in the expected position for the reply shape, rather than
scanning all extracted JSON values indiscriminately; keep using
`RequiredOutputContract::all_keys` and `super::parse::extract_json_values`, but
add a positional/ordering check before returning the object.
In `@src/openhuman/agent/harness/session/turn_tests.rs`:
- Around line 1095-1104: The test in turn_tests::synthesize_reply is asserting
the wrong fallback prose: the synthesis path preserves the original model reply,
not the failed corrective re-prompt. Update the assertion near
extract_json_values and the response.contains check to match the preserved prose
from the initial response ("Working on it.") while keeping the existing thoughts
block and request-count checks intact.
In `@src/openhuman/agent/harness/session/turn/core.rs`:
- Around line 101-112: The helper replace_last_assistant_reply currently scans
backward and rewrites the first assistant message it finds, which can mutate an
older transcript entry instead of preserving the tail-only behavior. Update the
logic so it only replaces the last history entry when it is a ChatMessage with
role assistant, and otherwise use the documented append fallback by pushing a
new assistant message without touching earlier messages.
In `@src/openhuman/config/schema/agent.rs`:
- Around line 196-214: The contract helper in AgentContract is treating a blank
block_key as optional, which lets all_keys() and is_active() activate contracts
from required_keys alone. Update AgentContract::all_keys (and therefore
AgentContract::is_active) so a blank block_key makes the contract inactive, and
ensure find_required_block and synthesize_block continue to rely on that
behavior when deciding whether enforcement applies.
---
Nitpick comments:
In `@src/openhuman/agent/harness/session/turn_tests.rs`:
- Around line 998-1005: Update the repair-response tests in turn_tests.rs so
they actually exercise the new repair-usage accounting path instead of only
using ChatResponse with usage: None. Add repair responses with populated usage
data in the relevant test cases (including the one around the existing
ChatResponse fixture and the matching case near the second referenced block) so
the folding logic in turn/core.rs is covered and regressions in repair
token/cost accounting are detected.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: af5bd4d9-d630-4935-bf24-f9d31a4f9bda
📒 Files selected for processing (8)
src/openhuman/agent/harness/mod.rssrc/openhuman/agent/harness/required_output.rssrc/openhuman/agent/harness/session/turn/core.rssrc/openhuman/agent/harness/session/turn/session_io.rssrc/openhuman/agent/harness/session/turn_tests.rssrc/openhuman/config/mod.rssrc/openhuman/config/schema/agent.rssrc/openhuman/config/schema/mod.rs
…idation, cap-path enforcement, inert blank key - find_required_block: require the block in the leading position (first JSON object) so a block buried after another object is repaired, not accepted. - all_keys: a blank block_key makes the contract inert even if required_keys lists siblings, so the defining key can never be silently dropped. - Enforce required output on the cap-checkpoint reply too, not just the normal finish, by moving enforcement after the reply is resolved. - replace_last_assistant_reply: only rewrite the trailing assistant row (last_mut), else append — never mutate an older transcript entry. - Fix the synthesize test assertion: the fallback preserves the ORIGINAL turn reply, not the failed re-prompt. Add positional + inert-key regressions.
Review feedback addressed (commit 7d066dd)All 6 inline points were legitimate and are fixed:
Added regressions for the positional (buried-block rejected, leading-block accepted) and inert-blank-key cases. |
sanil-23
left a comment
There was a problem hiding this comment.
Review — fix(agent-harness): enforce required structured-output block every turn (#4387)
Thorough read of all 8 changed files plus the surrounding turn engine (session/turn/core.rs, session_io.rs, tinyagents/observability.rs, parse.rs) and issue #4117. Overall this is a clean, well-scoped, well-documented change — the pure primitives are nicely separated and unit-tested, the config is opt-in (None default, so zero blast radius for existing agents), the accounting folds the repair call's usage correctly, and the tier-1/tier-2 repair design is sensible. The is_active/all_keys normalization (trim + dedup + blank-block-key → inert) is careful and the edge-case unit tests are good.
A few things worth addressing before merge, none of which are hard blockers.
Major
1. Streaming divergence: the repaired reply is never streamed, so the UI shows text that differs from what's persisted/returned. session_io.rs:246-258 — reprompt_for_required_block calls the provider with stream: None, and enforce_required_output runs after the turn's original reply has already been streamed to the UI via AgentProgress::TextDelta (tinyagents/observability.rs:241-256). Contrast this with the sibling summarize_iteration_checkpoint (session_io.rs:92-116), which does wire a delta_tx so the checkpoint renders incrementally.
Consequences for a user-facing agent that enables required_output:
- Tier 1 (re-prompt succeeds): the user watched "Sure, I'll handle that." stream in, but
turn()returns and persists an entirely different re-prompted answer — silently swapped, not streamed. - Tier 2 (synthesize): a
{"thoughts":"", ...}block is prepended to the original prose, again not reflected in the stream.
Since required_output lives on generic AgentConfig, any agent (including interactive chat) can enable it. Please either (a) stream the tier-1 re-prompt the same way the checkpoint path does, or (b) document that this contract is intended only for non-user-visible/internal-routing agents and note the stream/return divergence. At minimum, call this out in the RequiredOutputContract doc comment.
2. Merge collision with #4386 (fix/GH-4093-final-answer). That PR modifies the same three files in the same finalization region (turn/core.rs, session_io.rs, turn_tests.rs) and also rewrites the final reply before a turn ends. Both mutate the accepted-reply path at the tail of the turn. Whichever lands second will conflict non-trivially — coordinate ordering and re-test the combined finalization path (final-answer synthesis from #4386 feeding into required-output enforcement here). The PR body's collision note mentions #4093/#4089 but understates that #4386 edits the identical lines.
Minor
3. Synthesized fallback satisfies the shape but not the intent of the contract. required_output.rs:79-84 maps every key to "". Downstream routing that keys off next_action gets an empty string — schema-valid but semantically inert, arguably as unusable as a missing block for a consumer that switches on the value. That's a reasonable last-resort, but worth a one-line comment acknowledging that tier-2 guarantees presence, not usefulness, so a future consumer doesn't assume a meaningful value.
4. The feature is dormant — no agent or config sets required_output. Grep shows nothing wires a contract (only tests construct one). The mechanism + unit/turn tests satisfy #4117's first and third acceptance items, but there's no end-to-end path exercised in production and no RPC/UI/about_app surface to enable it. Fine as an infra PR, but please confirm a follow-up will actually attach a contract to the target agent(s), otherwise this is unreachable code.
5. Test coverage gaps. The two turn tests cover re-prompt-recovery and synthesize-fallback, but not:
- the no-op path — a reply that already carries the block makes zero extra provider calls (the cost-critical fast path);
- the cap-checkpoint + required-output interaction (enforcement explicitly claims to run on
hit_cap, per the comment atcore.rs:1031-1038, but no test drives it); - a provider error during the re-prompt (
reprompt_for_required_blockErrarm) falling through to synthesis.
These are cheap to add given the existingSequenceProviderharness.
Questions
- Token/latency cost: each finalized turn whose reply omits the block pays a full-history re-prompt (
to_provider_messages(&self.history)+ repair instruction). For a weak model that omits regularly, that's an extra full-context call on every turn — and on a capped turn it stacks on top of the checkpoint call (N tool calls + checkpoint + repair). Is that acceptable for the intended agent, or should the re-prompt send a trimmed context? - Intermediate steps: #4117's example block is
{"thoughts", "next_action"}, which reads like a per-step reasoning block. This PR enforces only the final turn reply, not intermediate tool-calling model steps. Confirm the downstream consumer only reads the final reply and doesn't expect the block on each step.
Nitpicks
enforce_required_outputdoc says usage isNone"when the re-prompt call made no request or failed" — accurate, but the tier-2 path still returnsSome((synthesized, usage))with that possibly-Noneusage, i.e. a repair with no accounted cost when the re-prompt itself failed. Correct behavior; just slightly non-obvious.find_required_blockrelies onextract_json_valuesreturning the first parseable JSON (prose/stray{tolerated) — the "leading position" doc comment is accurate, nice.
Nice work overall — the separation of pure primitives from the provider-driven orchestration and the accounting/history-rewrite handling are the strong points. My main asks are the streaming/return divergence (#1) and coordinating the #4386 collision (#2).
Automated read-only review; not an approval.
sanil-23
left a comment
There was a problem hiding this comment.
Requesting changes on the strength of one substantive item; the rest are non-blocking (full detail in my review comment above).
Blocking:
- Streaming/return divergence (major #1).
reprompt_for_required_blockruns withstream: Noneafter the original reply has already been streamed to the UI viaTextDelta, so a tier-1 re-prompt silently returns/persists a different answer than the user watched stream in, and tier-2 prepends an unstreamed block. The siblingsummarize_iteration_checkpointalready streams its re-prompt — please mirror that, or explicitly scope+documentRequiredOutputContractto non-user-visible/internal-routing agents. As written this ships a visible/return mismatch for any interactive agent that enables the contract.
Please also address before merge (non-blocking but expected):
- Coordinate the #4386 collision (same three files, same finalization region).
- Add the missing tests: block-present → zero extra calls, cap-checkpoint + enforcement, and the re-prompt-error → synthesize fallback.
Everything else (synth semantics, dormant wiring, token cost, intermediate-step scope) is a comment/question, not a blocker. Mechanism, accounting, and history-rewrite are otherwise solid — happy to re-approve once #1 is resolved.
Closes #4117
Problem
Agents frequently omit a required structured-output block (e.g. the mandated JSON
thoughtsblock) that the contract says must be emitted every turn. Downstream parsing/routing that depends on it gets nothing.Enforcement approach
Add an opt-in
RequiredOutputContract(block_key+ additionalrequired_keys) onAgentConfig. When set to an active contract, the turn engine validates the final reply and repairs an omitted block before the turn is accepted:The trailing assistant message is rewritten to match the repaired reply (transcript / KV-cache stay consistent) and the repair call's usage is folded into the turn accounting.
required_outputdefaults toNone, so existing agents are completely unaffected.Tests
required_output.rs).Files
config/schema/agent.rs—RequiredOutputContracttype +AgentConfig::required_outputagent/harness/required_output.rs(new) — pure validate/synthesize/instruction helpers + unit testssession/turn/session_io.rs—enforce_required_output+reprompt_for_required_blocksession/turn/core.rs— enforcement hook on the final-reply path + trailing-message rewritesession/turn_tests.rs— regression testsAcceptance caveat
Acceptance item 2 ("constrain output via response format / grammar") is addressed via the validate-and-repair loop rather than a native provider constraint:
ChatRequestcurrently has noresponse_format/grammar field, so a provider-level constraint would require plumbing across every provider — out of scope for this localized fix. The repair loop makes the block un-skippable regardless of provider support.Collision note
Touches the harness turn engine (shared with #4093 / #4089); the diff is kept localized to the output-validation path and will rebase onto
upstream/mainat merge.Summary by CodeRabbit
New Features
Bug Fixes
Tests