Skip to content

fix(inference): SSE inactivity watchdog to stop research-agent RESPONSE hang (#4269)#4393

Merged
senamakel merged 2 commits into
tinyhumansai:mainfrom
sanil-23:fix/4269-research-response-watchdog
Jul 3, 2026
Merged

fix(inference): SSE inactivity watchdog to stop research-agent RESPONSE hang (#4269)#4393
senamakel merged 2 commits into
tinyhumansai:mainfrom
sanil-23:fix/4269-research-response-watchdog

Conversation

@sanil-23

@sanil-23 sanil-23 commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Adds a per-chunk SSE inactivity watchdog to the native streaming path (stream_native_chat) that aborts a stalled stream after a configurable idle window (default 90s) instead of parking indefinitely on bytes_stream.next().await.
  • The window resets on every received chunk, so a legitimately long response that keeps emitting tokens is never cut.
  • Applies the same idle bound to the downstream delta send, so a wedged UI/progress consumer can't hang the turn on a full delta channel either.
  • The abort classifies as retryable, so ReliableProvider replays the turn (degrading to non-streaming on retry) — matching existing recovery behaviour.
  • New env knob OPENHUMAN_INFERENCE_STREAM_IDLE_TIMEOUT_SECS (default 90, range 1–3600).

Problem

#4269 — the research agent intermittently hangs in the RESPONSE phase after tool calls complete.

Reproduced end-to-end against a staging build by driving the research agent via CDP and looping deep-research queries (arxiv paper + web search — the issue's own repro shape). The failure is an upstream SSE stall on the read side:

21:40:40  [stream] OpenHuman POST .../chat/completions (stream=true, tools=14)
21:42:40  [stream] streaming chat failed, falling back to non-streaming: error decoding response body

Exactly 120s apart: the SSE body goes silent mid-response and the reader parks on bytes_stream.next().await (compatible_stream_native.rs) until the whole-request timeout cuts it. On default config that is a ~2-minute freeze ("cursor blinks, no output"); but #3856 advises operators to raise OPENHUMAN_INFERENCE_TIMEOUT_SECS up to 3600s for long research turns — with that raised, this exact stall hangs for up to an hour = the reported indefinite hang. The stall class appeared on ~2 of every 3 research runs; two runs blew past a 10-minute cap.

The whole-request timeout is the wrong instrument: it cannot distinguish "stalled, zero tokens" from "valid long response still streaming", and operators are told to raise it. A per-token inactivity watchdog bounds the stall independent of that knob.

Solution

  • compatible_timeout.rs — new stream_idle_timeout() (env OPENHUMAN_INFERENCE_STREAM_IDLE_TIMEOUT_SECS, default 90s, range 1..=3600), reusing the existing resolver + OnceLock cache pattern.
  • compatible.rsOpenAiCompatibleProvider carries the idle window (defaulted from config); a #[cfg(test)] with_stream_idle_timeout injects a small value in tests.
  • compatible_stream_native.rs — wrap the SSE read in a per-chunk tokio::time::timeout that resets each iteration; route every delta through a new forward_delta helper that applies the same idle bound to the send (dropped receiver = benign Ok; idle timeout = retryable bail). Both bail messages are crafted so reliable::is_non_retryable classifies them retryable.
  • .env.example — document the knob.

Scope note: the observed failure is the silent-stall case (no finish_reason / [DONE] arrives — the body just dies). Honoring the in-band terminal marker to end instantly when an upstream does send [DONE] but lingers the socket is an orthogonal correctness improvement, deferred as a follow-up.

Tests

  • stream_watchdog_trips_on_stalled_read — raw-TCP server flushes 200 SSE headers then goes silent; asserts a retryable watchdog abort (well before the whole-request timeout).
  • stream_watchdog_resets_on_each_chunk — chunks arriving under the idle window stream to completion (no false cut — the no-regression guarantee).
  • stream_watchdog_trips_on_wedged_delta_consumer — a capacity-1, never-drained delta channel trips the send-side watchdog.
  • compatible_timeout resolver tests for the new bound + boundaries.

Submission Checklist

If a section does not apply, mark the item N/A with a one-line reason. Do not delete items.

  • Tests added or updated (happy path + failure/edge case) — stalled-read, reset-on-chunk (happy), wedged-consumer (edge).
  • Diff coverage ≥ 80% — changed lines covered by the new Rust tests (verified locally via cargo-llvm-cov + diff-cover).
  • Coverage matrix updated — N/A: behaviour-only inference-runtime fix, no feature row added/removed/renamed.
  • All affected feature IDs listed under ## Related — N/A: no coverage-matrix feature ID applies.
  • No new external network dependencies introduced — the new tests use an in-process raw-TCP server + wiremock (already a dev-dep).
  • Manual smoke checklist updated if this touches release-cut surfaces — N/A: no release-manual-smoke surface changed.
  • Linked issue closed via Closes #NNN in ## Related.

Impact

  • Runtime impact: inference streaming path (all providers via the OpenAI-compatible provider), especially long research turns.
  • Performance impact: removes the indefinite RESPONSE-phase hang; adds one tokio::time::timeout per SSE chunk (negligible).
  • Tradeoff: a genuinely silent stream is now aborted+retried after the idle window instead of after the whole-request timeout. Retry degrades to non-streaming (existing behaviour). No token loss on the happy path (deltas still stream; the watchdog only fires on a full idle window).
  • No persistence, migration, security, or network-dependency changes.

Related

Summary by CodeRabbit

  • New Features

    • Added a configurable stream-idle watchdog timeout for streamed responses via OPENHUMAN_INFERENCE_STREAM_IDLE_TIMEOUT_SECS (documented default and allowed range).
  • Bug Fixes

    • Prevented watchdog false-failures by immediately completing when the terminal [DONE] sentinel is received.
    • Improved handling of slow downstreams: dropped receivers no longer fail the stream, while sustained backpressure triggers a retryable watchdog error.
  • Tests

    • Added coverage for stream-idle timeouts, watchdog reset on each chunk, mid-body upstream stalls, delta backpressure behavior, and [DONE] completion behavior.

@sanil-23 sanil-23 requested a review from a team July 1, 2026 19:26
@coderabbitai

coderabbitai Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 59ce25e7-4718-449e-a315-78bc329840d6

📥 Commits

Reviewing files that changed from the base of the PR and between 35be03b and 17f4eb8.

📒 Files selected for processing (5)
  • .env.example
  • src/openhuman/inference/provider/compatible.rs
  • src/openhuman/inference/provider/compatible_stream_native.rs
  • src/openhuman/inference/provider/compatible_tests.rs
  • src/openhuman/inference/provider/compatible_timeout.rs
✅ Files skipped from review due to trivial changes (1)
  • .env.example
🚧 Files skipped from review as they are similar to previous changes (4)
  • src/openhuman/inference/provider/compatible.rs
  • src/openhuman/inference/provider/compatible_stream_native.rs
  • src/openhuman/inference/provider/compatible_timeout.rs
  • src/openhuman/inference/provider/compatible_tests.rs

📝 Walkthrough

Walkthrough

Adds a configurable per-chunk streaming idle timeout, wires it into OpenAI-compatible streaming, applies watchdogs to SSE reads and delta forwarding, and adds coverage for stalled reads, backpressure, dropped receivers, and terminal completion.

Changes

Stream idle watchdog

Layer / File(s) Summary
Stream idle timeout configuration
.env.example, src/openhuman/inference/provider/compatible_timeout.rs
Adds OPENHUMAN_INFERENCE_STREAM_IDLE_TIMEOUT_SECS docs, bounds, cached resolution, and timeout tests.
Provider wiring for stream_idle_timeout
src/openhuman/inference/provider/compatible.rs
Adds stream_idle_timeout to OpenAiCompatibleProvider, initializes it from config, and adds a test-only override.
SSE read and delta forwarding watchdog
src/openhuman/inference/provider/compatible_stream_native.rs
Wraps SSE reads and delta sends in timeouts, adds forward_delta, and stops immediately on [DONE].
Watchdog test coverage
src/openhuman/inference/provider/compatible_tests.rs
Adds helpers and tests for stalled reads, chunk resets, wedged consumers, dropped receivers, and lingering sockets after [DONE].

Estimated code review effort: 4 (Complex) | ~50 minutes

Sequence Diagram(s)

sequenceDiagram
  participant stream_native_chat
  participant bytes_stream
  participant forward_delta
  participant delta_tx

  loop per chunk
    stream_native_chat->>bytes_stream: next()
    alt chunk arrives before idle timeout
      bytes_stream-->>stream_native_chat: SSE chunk
      stream_native_chat->>forward_delta: send(delta)
      forward_delta->>delta_tx: send(delta)
      alt receiver closed
        delta_tx-->>forward_delta: dropped
        forward_delta-->>stream_native_chat: Ok(())
      else send succeeds
        delta_tx-->>forward_delta: sent
        forward_delta-->>stream_native_chat: Ok(())
      else backpressure timeout
        forward_delta-->>stream_native_chat: retryable error
      end
    else no chunk before idle timeout
      bytes_stream-->>stream_native_chat: timeout
      stream_native_chat-->>stream_native_chat: abort retryable watchdog error
    end
  end
  stream_native_chat->>stream_native_chat: stop on [DONE]
Loading

Poem

A rabbit watched the stream go still,
No token hopped across the hill.
Tick-tock went the idle clock,
And stalled streams found a safer dock. 🐇⏱️

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Linked Issues check ❓ Inconclusive The watchdog fix and tests address the RESPONSE hang objective, but the 80% diff-coverage requirement cannot be verified from the provided context. Provide coverage results or CI output confirming the changed-lines coverage meets the 80% gate.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title is concise and accurately describes the main change: adding an SSE inactivity watchdog to fix RESPONSE hangs.
Out of Scope Changes check ✅ Passed All described changes support the SSE timeout fix, configuration, or tests; no unrelated code paths are evident.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/openhuman/inference/provider/compatible_stream_native.rs`:
- Around line 263-293: The SSE loop in compatible_stream_native.rs is still
treating the terminal `[DONE]` event as non-final, so it re-arms the idle
watchdog and can incorrectly fail a completed response if the socket stays open.
Update the stream-reading logic around the `bytes_stream.next()` loop and
`[DONE]` handling so the loop exits immediately on the terminal sentinel instead
of continuing to another read. Add a regression test in the same provider path
that emits `[DONE]` and then keeps the connection open to verify the stream
completes without tripping `stream_idle_timeout`.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 82583067-bccd-4429-9c9a-93da830c0165

📥 Commits

Reviewing files that changed from the base of the PR and between 4c98a31 and 5da7bfd.

📒 Files selected for processing (5)
  • .env.example
  • src/openhuman/inference/provider/compatible.rs
  • src/openhuman/inference/provider/compatible_stream_native.rs
  • src/openhuman/inference/provider/compatible_tests.rs
  • src/openhuman/inference/provider/compatible_timeout.rs

Comment thread src/openhuman/inference/provider/compatible_stream_native.rs
sanil-23 added a commit to sanil-23/openhuman that referenced this pull request Jul 1, 2026
…e watchdog

CodeRabbit review on tinyhumansai#4393: after the terminal [DONE] sentinel the loop re-armed
the idle watchdog, so a provider that sends [DONE] but holds the socket open
would fail an already-complete response as a retryable stall. Break on [DONE];
add a regression that sends [DONE] then keeps the connection open.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
sanil-23 added a commit to sanil-23/openhuman that referenced this pull request Jul 2, 2026
…e watchdog

CodeRabbit review on tinyhumansai#4393: after the terminal [DONE] sentinel the loop re-armed
the idle watchdog, so a provider that sends [DONE] but holds the socket open
would fail an already-complete response as a retryable stall. Break on [DONE];
add a regression that sends [DONE] then keeps the connection open.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@sanil-23 sanil-23 force-pushed the fix/4269-research-response-watchdog branch from e0e5652 to 35be03b Compare July 2, 2026 06:53
sanil-23 and others added 2 commits July 3, 2026 11:49
…SE hang (tinyhumansai#4269)

The native streaming path parked on `bytes_stream.next().await` when an
upstream flushed 200 then went silent mid-response — cut only by the blunt
whole-request timeout, which tinyhumansai#3856 tells operators to raise up to 1h,
turning the stall into an indefinite hang. Add a per-chunk inactivity
watchdog that resets on every token and aborts a stalled stream with a
retryable error (ReliableProvider replays it); also bound the delta send so
a wedged consumer can't hang the turn on a full channel. New env knob
OPENHUMAN_INFERENCE_STREAM_IDLE_TIMEOUT_SECS (default 90s, range 1-3600).

Reproduced against staging via CDP: deep-research streams stalled the full
120s request-timeout window, surfacing as "error decoding response body".

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…e watchdog

CodeRabbit review on tinyhumansai#4393: after the terminal [DONE] sentinel the loop re-armed
the idle watchdog, so a provider that sends [DONE] but holds the socket open
would fail an already-complete response as a retryable stall. Break on [DONE];
add a regression that sends [DONE] then keeps the connection open.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@sanil-23 sanil-23 force-pushed the fix/4269-research-response-watchdog branch from 35be03b to 17f4eb8 Compare July 3, 2026 06:22

@M3gA-Mind M3gA-Mind left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR #4393 — SSE inactivity watchdog to stop research-agent RESPONSE hang (#4269)

Walkthrough

Wraps each SSE read in stream_native_chat with a per-chunk inactivity tokio::time::timeout (default 90s, env-overridable, resets on every chunk), and applies the same bound to each downstream delta send via a new forward_delta helper. A stall aborts as a retryable error so ReliableProvider replays the turn (degrading to non-streaming), instead of parking on next().await until the raised whole-request ceiling. Approach is sound, correctly scoped to the reproduced native path, and unusually well tested (stall, per-chunk reset, wedged consumer, dropped receiver, and the [DONE]-lingers regression). Solid work — no blockers from me.

Changes

File Summary
.env.example Documents new OPENHUMAN_INFERENCE_STREAM_IDLE_TIMEOUT_SECS knob.
compatible.rs Adds stream_idle_timeout field + test-only with_stream_idle_timeout setter.
compatible_stream_native.rs Read loop → watchdog-armed loop; [DONE] now break not continue; delta sends routed through new forward_delta.
compatible_timeout.rs stream_idle_timeout() resolver (default 90, 1–3600) + unit tests.
compatible_tests.rs 5 integration tests incl. a raw scripted-SSE TCP server for deterministic stalls.

Actionable comments (2)

⚠️ Major / Question

1. src/openhuman/inference/provider/compatible_stream.rs:32 — sibling SSE reader has the identical unguarded park

The other streaming path, sse_bytes_to_chunks, still does a bare while let Some(item) = bytes_stream.next().await with no inactivity bound. It's live — invoked from compatible_provider_impl.rs:1159 (stream_chat_with_system) and :1419 (stream_chat_with_history) — so the same upstream "flush 200 then go silent" stall that #4269 fixes on the native path can still hang a turn here until the whole-request ceiling.

Not necessarily blocking (the research-agent repro goes through stream_native_chat, which this PR covers), but worth confirming: are those two entrypoints reachable for the flows #4269 cares about? If yes, this fix is incomplete without the same watchdog there; if they're legacy/unused, a one-line note on the PR would close the scope question. A follow-up issue is a fine resolution either way.

💡 Consideration

2. src/openhuman/inference/provider/compatible_timeout.rs:33 — 90s idle vs providers that buffer reasoning silently

The comment argues reasoning pauses are safe because thinking streams as reasoning_content deltas that reset the window. That holds for managed/OpenAI-style reasoning, but some providers emit no deltas during a long think and then flush the answer. Against those, a >90s silent think would trip the watchdog and force an (unnecessary) retry. It degrades gracefully — retryable → non-streaming replay still returns an answer — and it's env-tunable, so this is a heads-up, not a defect. Consider a one-line caveat in the rustdoc so an operator debugging "why did my slow local reasoning model retry" finds the knob quickly.

Nitpicks (1)

  • compatible_stream_native.rs — the read loop copies let idle_window = self.stream_idle_timeout; while forward_delta reads self.stream_idle_timeout directly. Harmless, but using one form in both spots reads cleaner.

Verified / looks good

  • Retryable classification is correct. Neither watchdog message ("no response data…" / "delta channel stalled…") matches any is_non_retryable hint (no structured 4xx, no auth/quota/session-expired substrings), and both stream_watchdog_trips_* tests assert !is_non_retryable. So ReliableProvider genuinely replays.
  • [DONE]break 'stream is right, and the stream_stops_on_done_even_if_socket_lingers test nails the exact watchdog×terminal-sentinel interaction that would otherwise fail a completed response as a stall.
  • forward_delta preserves prior semantics for the two benign cases (success; dropped receiver → Ok), adding only the wedge timeout. The old let _ = send().await also awaited, so healthy consumers behave identically.
  • Bounds/env resolution reuse the existing resolve/parse_timeout_secs path; 0/out-of-range/typo all fall back to the default (can't accidentally disable the guard) — covered by stream_idle_parse_respects_bounds.
  • Timeout-cancelled bytes_stream.next() is dropped on the bail path, so there's no partial-read/data-loss concern.

Posted as a comment (not an approval) — leaving the approve/request-changes call to the maintainer.

@senamakel senamakel merged commit f1987a5 into tinyhumansai:main Jul 3, 2026
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Research agent intermittently hangs mid-run

3 participants