Skip to content

fix(agent-harness): enforce required structured-output block every turn#4387

Open
M3gA-Mind wants to merge 3 commits into
tinyhumansai:mainfrom
M3gA-Mind:fix/GH-4117-required-structured-output
Open

fix(agent-harness): enforce required structured-output block every turn#4387
M3gA-Mind wants to merge 3 commits into
tinyhumansai:mainfrom
M3gA-Mind:fix/GH-4117-required-structured-output

Conversation

@M3gA-Mind

@M3gA-Mind M3gA-Mind commented Jul 1, 2026

Copy link
Copy Markdown
Collaborator

Closes #4117

Problem

Agents frequently omit a required structured-output block (e.g. the mandated JSON thoughts block) that the contract says must be emitted every turn. Downstream parsing/routing that depends on it gets nothing.

Enforcement approach

Add an opt-in RequiredOutputContract (block_key + additional required_keys) on AgentConfig. When set to an active contract, the turn engine validates the final reply and repairs an omitted block before the turn is accepted:

  1. Validate presence + shape — the reply must contain a JSON object carrying every required key with a non-null value (reuses the harness JSON extractor, so fenced / inline / whole-object replies are all recognised).
  2. Repair (tier 1) — one corrective re-prompt with native tools disabled, seeded with the omitting reply + a corrective instruction; if it recovers a valid block, that becomes the reply.
  3. Repair (tier 2) — otherwise a minimal, schema-valid block is synthesized and prepended to the model's prose, so the accepted turn is never left without a well-formed block.

The trailing assistant message is rewritten to match the repaired reply (transcript / KV-cache stay consistent) and the repair call's usage is folded into the turn accounting. required_output defaults to None, so existing agents are completely unaffected.

Tests

  • Unit tests for the validator / synthesizer / repair-instruction (required_output.rs).
  • Turn regressions: omitted-block → recovered via re-prompt, and omitted-block → synthesized fallback when the re-prompt also omits it.

Files

  • config/schema/agent.rsRequiredOutputContract type + AgentConfig::required_output
  • agent/harness/required_output.rs (new) — pure validate/synthesize/instruction helpers + unit tests
  • session/turn/session_io.rsenforce_required_output + reprompt_for_required_block
  • session/turn/core.rs — enforcement hook on the final-reply path + trailing-message rewrite
  • session/turn_tests.rs — regression tests

Acceptance caveat

Acceptance item 2 ("constrain output via response format / grammar") is addressed via the validate-and-repair loop rather than a native provider constraint: ChatRequest currently has no response_format/grammar field, so a provider-level constraint would require plumbing across every provider — out of scope for this localized fix. The repair loop makes the block un-skippable regardless of provider support.

Collision note

Touches the harness turn engine (shared with #4093 / #4089); the diff is kept localized to the output-validation path and will rebase onto upstream/main at merge.

Summary by CodeRabbit

  • New Features

    • Added support for required structured output in agent responses, including configurable JSON block requirements.
    • Final replies can now be repaired automatically if the required content is missing or incomplete.
  • Bug Fixes

    • Improved turn handling so accepted responses always include the expected structured block.
    • If a retry still fails, the system now preserves the model’s text while adding a minimal valid block.
  • Tests

    • Added coverage for recovery flows, including retry-based repair and fallback synthesis.

…rn (tinyhumansai#4117)

Agents frequently omit a mandated JSON block (e.g. a `thoughts` block) that
the contract says must be emitted every turn, leaving downstream parsing with
nothing.

Add an opt-in `RequiredOutputContract` (block key + required sibling keys) on
`AgentConfig`. When set, the turn engine validates the final reply and repairs
an omitted block before the turn is accepted:
  1. one corrective re-prompt with tools disabled, then
  2. a synthesized minimal valid block prepended to the prose as a
     deterministic fallback.
The trailing assistant message is rewritten to match and the repair call's
usage is folded into the turn accounting. `None` (default) is a no-op, so
existing agents are unaffected.

Validation reuses the harness JSON extractor. Adds unit tests for the
validator/synthesizer and turn regressions for the omitted-block -> repaired
and omitted-block -> synthesized paths.
@M3gA-Mind M3gA-Mind requested a review from a team July 1, 2026 18:10
@coderabbitai

coderabbitai Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Warning

Review limit reached

You’ve reached a temporary PR review limit under our Fair Usage Limits Policy.

Your recent review volume is higher than typical usage, so adaptive limits are currently applied.

Next review available in: 34 minutes

Enable usage-based reviews in Billing to review now. Otherwise, wait until the next included review is available.
You're only billed for reviews past your plan's rate limits ($0.25/file).

How can I continue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews.

How do review limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please refer docs for additional details.

Review details
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: feac2ad4-9445-4bcf-a8fc-0c1fdb240514

📥 Commits

Reviewing files that changed from the base of the PR and between 852b7ce and 7d066dd.

📒 Files selected for processing (6)
  • src/openhuman/agent/harness/required_output.rs
  • src/openhuman/agent/harness/session/turn/core.rs
  • src/openhuman/agent/harness/session/turn/session_io.rs
  • src/openhuman/agent/harness/session/turn_tests.rs
  • src/openhuman/config/mod.rs
  • src/openhuman/config/schema/agent.rs
📝 Walkthrough

Walkthrough

Adds a RequiredOutputContract configuration type with active-state and key-normalization helpers, wired into AgentConfig. A new harness module validates, re-prompts, and synthesizes required JSON blocks during turn finalization, rewriting persisted assistant history, tracking repair usage/cost, and includes unit and turn-level tests.

Changes

Required Output Contract Enforcement

Layer / File(s) Summary
Contract config type and re-exports
src/openhuman/config/schema/agent.rs, src/openhuman/config/schema/mod.rs, src/openhuman/config/mod.rs
Adds RequiredOutputContract with new, all_keys, is_active; wires optional required_output field into AgentConfig (default None); re-exports the type through schema and config modules.
Validation/repair primitives
src/openhuman/agent/harness/required_output.rs, src/openhuman/agent/harness/mod.rs
Adds output_satisfies_contract, find_required_block, synthesize_block, repair_instruction, plus unit tests; registers the new required_output submodule.
Turn integration and history rewrite
src/openhuman/agent/harness/session/turn/session_io.rs, src/openhuman/agent/harness/session/turn/core.rs
Adds enforce_required_output and reprompt_for_required_block to attempt re-prompt then synthesis repair; replace_last_assistant_reply rewrites the trailing assistant history entry; turn finalization folds repair usage/cost into totals.
Turn-level regression tests
src/openhuman/agent/harness/session/turn_tests.rs
Adds tests for re-prompt recovery and synthesis fallback, asserting provider call counts and repaired history content.

Estimated code review effort: 3 (Moderate) | ~25 minutes

Sequence Diagram(s)

sequenceDiagram
participant Turn as Turn Finalization
participant Agent
participant Provider
participant History

Turn->>Agent: enforce_required_output(reply, contract)
Agent->>Agent: output_satisfies_contract(reply)?
alt contract violated
    Agent->>Provider: reprompt_for_required_block(repair_instruction)
    Provider-->>Agent: repaired text, usage
    Agent->>Agent: recheck contract
    alt still violated
        Agent->>Agent: synthesize_block(contract), prepend to text
    end
    Agent-->>Turn: repaired text, usage
    Turn->>History: replace_last_assistant_reply(repaired text)
else contract satisfied
    Agent-->>Turn: None
end
Loading

Suggested labels: bug, agent

Poem

A rabbit checks each JSON thought,
"thoughts" and "next_action" — as they ought!
If prose alone comes hopping through,
I re-prompt once, then patch anew.
No empty block escapes my burrow's law. 🐇📋

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: enforcing required structured output in the agent harness each turn.
Linked Issues check ✅ Passed The new contract, validation/repair flow, and regression tests satisfy #4117's requirement to ensure the block is present every turn.
Out of Scope Changes check ✅ Passed The changes stay focused on required-output enforcement, config plumbing, and tests, with no unrelated additions.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot added agent Built-in agents, prompts, orchestration, and agent runtime in src/openhuman/agent/. bug labels Jul 1, 2026

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 852b7cef5c

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/openhuman/agent/harness/session/turn_tests.rs Outdated
Comment thread src/openhuman/agent/harness/session/turn/core.rs Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (1)
src/openhuman/agent/harness/session/turn_tests.rs (1)

998-1005: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Please exercise the new repair-usage accounting path here.

Both repair responses use usage: None, so these regressions never cover the new token/cost folding added in src/openhuman/agent/harness/session/turn/core.rs. A break in repair accounting would still pass this suite.

Also applies to: 1064-1069

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/openhuman/agent/harness/session/turn_tests.rs` around lines 998 - 1005,
Update the repair-response tests in turn_tests.rs so they actually exercise the
new repair-usage accounting path instead of only using ChatResponse with usage:
None. Add repair responses with populated usage data in the relevant test cases
(including the one around the existing ChatResponse fixture and the matching
case near the second referenced block) so the folding logic in turn/core.rs is
covered and regressions in repair token/cost accounting are detected.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/openhuman/agent/harness/required_output.rs`:
- Around line 34-55: `find_required_block()` is too permissive because it
accepts the first matching JSON object anywhere in the reply, which lets
`output_satisfies_contract()` pass when the required block is buried after other
text. Update `find_required_block` (and any caller logic in
`output_satisfies_contract`/`enforce_required_output`) to only accept a required
block when it appears in the expected position for the reply shape, rather than
scanning all extracted JSON values indiscriminately; keep using
`RequiredOutputContract::all_keys` and `super::parse::extract_json_values`, but
add a positional/ordering check before returning the object.

In `@src/openhuman/agent/harness/session/turn_tests.rs`:
- Around line 1095-1104: The test in turn_tests::synthesize_reply is asserting
the wrong fallback prose: the synthesis path preserves the original model reply,
not the failed corrective re-prompt. Update the assertion near
extract_json_values and the response.contains check to match the preserved prose
from the initial response ("Working on it.") while keeping the existing thoughts
block and request-count checks intact.

In `@src/openhuman/agent/harness/session/turn/core.rs`:
- Around line 101-112: The helper replace_last_assistant_reply currently scans
backward and rewrites the first assistant message it finds, which can mutate an
older transcript entry instead of preserving the tail-only behavior. Update the
logic so it only replaces the last history entry when it is a ChatMessage with
role assistant, and otherwise use the documented append fallback by pushing a
new assistant message without touching earlier messages.

In `@src/openhuman/config/schema/agent.rs`:
- Around line 196-214: The contract helper in AgentContract is treating a blank
block_key as optional, which lets all_keys() and is_active() activate contracts
from required_keys alone. Update AgentContract::all_keys (and therefore
AgentContract::is_active) so a blank block_key makes the contract inactive, and
ensure find_required_block and synthesize_block continue to rely on that
behavior when deciding whether enforcement applies.

---

Nitpick comments:
In `@src/openhuman/agent/harness/session/turn_tests.rs`:
- Around line 998-1005: Update the repair-response tests in turn_tests.rs so
they actually exercise the new repair-usage accounting path instead of only
using ChatResponse with usage: None. Add repair responses with populated usage
data in the relevant test cases (including the one around the existing
ChatResponse fixture and the matching case near the second referenced block) so
the folding logic in turn/core.rs is covered and regressions in repair
token/cost accounting are detected.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: af5bd4d9-d630-4935-bf24-f9d31a4f9bda

📥 Commits

Reviewing files that changed from the base of the PR and between 4c98a31 and 852b7ce.

📒 Files selected for processing (8)
  • src/openhuman/agent/harness/mod.rs
  • src/openhuman/agent/harness/required_output.rs
  • src/openhuman/agent/harness/session/turn/core.rs
  • src/openhuman/agent/harness/session/turn/session_io.rs
  • src/openhuman/agent/harness/session/turn_tests.rs
  • src/openhuman/config/mod.rs
  • src/openhuman/config/schema/agent.rs
  • src/openhuman/config/schema/mod.rs

Comment thread src/openhuman/agent/harness/required_output.rs Outdated
Comment thread src/openhuman/agent/harness/session/turn_tests.rs Outdated
Comment thread src/openhuman/agent/harness/session/turn/core.rs Outdated
Comment thread src/openhuman/config/schema/agent.rs
M3gA-Mind added 2 commits July 1, 2026 23:48
…idation, cap-path enforcement, inert blank key

- find_required_block: require the block in the leading position (first JSON
  object) so a block buried after another object is repaired, not accepted.
- all_keys: a blank block_key makes the contract inert even if required_keys
  lists siblings, so the defining key can never be silently dropped.
- Enforce required output on the cap-checkpoint reply too, not just the normal
  finish, by moving enforcement after the reply is resolved.
- replace_last_assistant_reply: only rewrite the trailing assistant row
  (last_mut), else append — never mutate an older transcript entry.
- Fix the synthesize test assertion: the fallback preserves the ORIGINAL turn
  reply, not the failed re-prompt. Add positional + inert-key regressions.
@M3gA-Mind

Copy link
Copy Markdown
Collaborator Author

Review feedback addressed (commit 7d066dd)

All 6 inline points were legitimate and are fixed:

  1. Positional validation (required_output.rs) — find_required_block now requires the mandated block in the leading position (first extracted JSON object). Prose before the block is still accepted; a block buried after another JSON object is rejected and repaired.
  2. Blank block_key is inert (config/schema/agent.rs) — all_keys() returns empty when block_key is blank, so a contract listing only required_keys can never accept/synthesize a block missing the defining key.
  3. Cap-path enforcement (turn/core.rs) — enforcement now runs on the resolved reply for both the normal-finish and cap-checkpoint paths, so a capped turn's checkpoint is also validated/repaired before it's returned and persisted.
  4. Trailing-row rewrite (turn/core.rs) — replace_last_assistant_reply uses history.last_mut(); only the tail assistant row is rewritten, otherwise the append fallback runs (never mutates an older entry).
  5. Test assertion fix (turn_tests.rs) — the synthesize fallback preserves the original turn reply, so the test asserts contains("Working on it.") and that the leading JSON object carries the block.

Added regressions for the positional (buried-block rejected, leading-block accepted) and inert-blank-key cases. cargo check + cargo fmt --check are clean.

@sanil-23 sanil-23 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — fix(agent-harness): enforce required structured-output block every turn (#4387)

Thorough read of all 8 changed files plus the surrounding turn engine (session/turn/core.rs, session_io.rs, tinyagents/observability.rs, parse.rs) and issue #4117. Overall this is a clean, well-scoped, well-documented change — the pure primitives are nicely separated and unit-tested, the config is opt-in (None default, so zero blast radius for existing agents), the accounting folds the repair call's usage correctly, and the tier-1/tier-2 repair design is sensible. The is_active/all_keys normalization (trim + dedup + blank-block-key → inert) is careful and the edge-case unit tests are good.

A few things worth addressing before merge, none of which are hard blockers.

Major

1. Streaming divergence: the repaired reply is never streamed, so the UI shows text that differs from what's persisted/returned. session_io.rs:246-258reprompt_for_required_block calls the provider with stream: None, and enforce_required_output runs after the turn's original reply has already been streamed to the UI via AgentProgress::TextDelta (tinyagents/observability.rs:241-256). Contrast this with the sibling summarize_iteration_checkpoint (session_io.rs:92-116), which does wire a delta_tx so the checkpoint renders incrementally.

Consequences for a user-facing agent that enables required_output:

  • Tier 1 (re-prompt succeeds): the user watched "Sure, I'll handle that." stream in, but turn() returns and persists an entirely different re-prompted answer — silently swapped, not streamed.
  • Tier 2 (synthesize): a {"thoughts":"", ...} block is prepended to the original prose, again not reflected in the stream.

Since required_output lives on generic AgentConfig, any agent (including interactive chat) can enable it. Please either (a) stream the tier-1 re-prompt the same way the checkpoint path does, or (b) document that this contract is intended only for non-user-visible/internal-routing agents and note the stream/return divergence. At minimum, call this out in the RequiredOutputContract doc comment.

2. Merge collision with #4386 (fix/GH-4093-final-answer). That PR modifies the same three files in the same finalization region (turn/core.rs, session_io.rs, turn_tests.rs) and also rewrites the final reply before a turn ends. Both mutate the accepted-reply path at the tail of the turn. Whichever lands second will conflict non-trivially — coordinate ordering and re-test the combined finalization path (final-answer synthesis from #4386 feeding into required-output enforcement here). The PR body's collision note mentions #4093/#4089 but understates that #4386 edits the identical lines.

Minor

3. Synthesized fallback satisfies the shape but not the intent of the contract. required_output.rs:79-84 maps every key to "". Downstream routing that keys off next_action gets an empty string — schema-valid but semantically inert, arguably as unusable as a missing block for a consumer that switches on the value. That's a reasonable last-resort, but worth a one-line comment acknowledging that tier-2 guarantees presence, not usefulness, so a future consumer doesn't assume a meaningful value.

4. The feature is dormant — no agent or config sets required_output. Grep shows nothing wires a contract (only tests construct one). The mechanism + unit/turn tests satisfy #4117's first and third acceptance items, but there's no end-to-end path exercised in production and no RPC/UI/about_app surface to enable it. Fine as an infra PR, but please confirm a follow-up will actually attach a contract to the target agent(s), otherwise this is unreachable code.

5. Test coverage gaps. The two turn tests cover re-prompt-recovery and synthesize-fallback, but not:

  • the no-op path — a reply that already carries the block makes zero extra provider calls (the cost-critical fast path);
  • the cap-checkpoint + required-output interaction (enforcement explicitly claims to run on hit_cap, per the comment at core.rs:1031-1038, but no test drives it);
  • a provider error during the re-prompt (reprompt_for_required_block Err arm) falling through to synthesis.
    These are cheap to add given the existing SequenceProvider harness.

Questions

  • Token/latency cost: each finalized turn whose reply omits the block pays a full-history re-prompt (to_provider_messages(&self.history) + repair instruction). For a weak model that omits regularly, that's an extra full-context call on every turn — and on a capped turn it stacks on top of the checkpoint call (N tool calls + checkpoint + repair). Is that acceptable for the intended agent, or should the re-prompt send a trimmed context?
  • Intermediate steps: #4117's example block is {"thoughts", "next_action"}, which reads like a per-step reasoning block. This PR enforces only the final turn reply, not intermediate tool-calling model steps. Confirm the downstream consumer only reads the final reply and doesn't expect the block on each step.

Nitpicks

  • enforce_required_output doc says usage is None "when the re-prompt call made no request or failed" — accurate, but the tier-2 path still returns Some((synthesized, usage)) with that possibly-None usage, i.e. a repair with no accounted cost when the re-prompt itself failed. Correct behavior; just slightly non-obvious.
  • find_required_block relies on extract_json_values returning the first parseable JSON (prose/stray { tolerated) — the "leading position" doc comment is accurate, nice.

Nice work overall — the separation of pure primitives from the provider-driven orchestration and the accounting/history-rewrite handling are the strong points. My main asks are the streaming/return divergence (#1) and coordinating the #4386 collision (#2).

Automated read-only review; not an approval.

@sanil-23 sanil-23 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting changes on the strength of one substantive item; the rest are non-blocking (full detail in my review comment above).

Blocking:

  • Streaming/return divergence (major #1). reprompt_for_required_block runs with stream: None after the original reply has already been streamed to the UI via TextDelta, so a tier-1 re-prompt silently returns/persists a different answer than the user watched stream in, and tier-2 prepends an unstreamed block. The sibling summarize_iteration_checkpoint already streams its re-prompt — please mirror that, or explicitly scope+document RequiredOutputContract to non-user-visible/internal-routing agents. As written this ships a visible/return mismatch for any interactive agent that enables the contract.

Please also address before merge (non-blocking but expected):

  • Coordinate the #4386 collision (same three files, same finalization region).
  • Add the missing tests: block-present → zero extra calls, cap-checkpoint + enforcement, and the re-prompt-error → synthesize fallback.

Everything else (synth semantics, dormant wiring, token cost, intermediate-step scope) is a comment/question, not a blocker. Mechanism, accounting, and history-rewrite are otherwise solid — happy to re-approve once #1 is resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent Built-in agents, prompts, orchestration, and agent runtime in src/openhuman/agent/. bug

Projects

Status: Todo

Development

Successfully merging this pull request may close these issues.

[Harness] Required structured output (JSON 'thoughts' block) frequently omitted

2 participants