fix(inference): SSE inactivity watchdog to stop research-agent RESPONSE hang (#4269) by sanil-23 · Pull Request #4393 · tinyhumansai/openhuman

sanil-23 · 2026-07-01T19:25:59Z

Summary

Adds a per-chunk SSE inactivity watchdog to the native streaming path (stream_native_chat) that aborts a stalled stream after a configurable idle window (default 90s) instead of parking indefinitely on bytes_stream.next().await.
The window resets on every received chunk, so a legitimately long response that keeps emitting tokens is never cut.
Applies the same idle bound to the downstream delta send, so a wedged UI/progress consumer can't hang the turn on a full delta channel either.
The abort classifies as retryable, so ReliableProvider replays the turn (degrading to non-streaming on retry) — matching existing recovery behaviour.
New env knob OPENHUMAN_INFERENCE_STREAM_IDLE_TIMEOUT_SECS (default 90, range 1–3600).

Problem

#4269 — the research agent intermittently hangs in the RESPONSE phase after tool calls complete.

Reproduced end-to-end against a staging build by driving the research agent via CDP and looping deep-research queries (arxiv paper + web search — the issue's own repro shape). The failure is an upstream SSE stall on the read side:

21:40:40  [stream] OpenHuman POST .../chat/completions (stream=true, tools=14)
21:42:40  [stream] streaming chat failed, falling back to non-streaming: error decoding response body

Exactly 120s apart: the SSE body goes silent mid-response and the reader parks on bytes_stream.next().await (compatible_stream_native.rs) until the whole-request timeout cuts it. On default config that is a ~2-minute freeze ("cursor blinks, no output"); but #3856 advises operators to raise OPENHUMAN_INFERENCE_TIMEOUT_SECS up to 3600s for long research turns — with that raised, this exact stall hangs for up to an hour = the reported indefinite hang. The stall class appeared on ~2 of every 3 research runs; two runs blew past a 10-minute cap.

The whole-request timeout is the wrong instrument: it cannot distinguish "stalled, zero tokens" from "valid long response still streaming", and operators are told to raise it. A per-token inactivity watchdog bounds the stall independent of that knob.

Solution

compatible_timeout.rs — new stream_idle_timeout() (env OPENHUMAN_INFERENCE_STREAM_IDLE_TIMEOUT_SECS, default 90s, range 1..=3600), reusing the existing resolver + OnceLock cache pattern.
compatible.rs — OpenAiCompatibleProvider carries the idle window (defaulted from config); a #[cfg(test)] with_stream_idle_timeout injects a small value in tests.
compatible_stream_native.rs — wrap the SSE read in a per-chunk tokio::time::timeout that resets each iteration; route every delta through a new forward_delta helper that applies the same idle bound to the send (dropped receiver = benign Ok; idle timeout = retryable bail). Both bail messages are crafted so reliable::is_non_retryable classifies them retryable.
.env.example — document the knob.

Scope note: the observed failure is the silent-stall case (no finish_reason / [DONE] arrives — the body just dies). Honoring the in-band terminal marker to end instantly when an upstream does send [DONE] but lingers the socket is an orthogonal correctness improvement, deferred as a follow-up.

Tests

stream_watchdog_trips_on_stalled_read — raw-TCP server flushes 200 SSE headers then goes silent; asserts a retryable watchdog abort (well before the whole-request timeout).
stream_watchdog_resets_on_each_chunk — chunks arriving under the idle window stream to completion (no false cut — the no-regression guarantee).
stream_watchdog_trips_on_wedged_delta_consumer — a capacity-1, never-drained delta channel trips the send-side watchdog.
compatible_timeout resolver tests for the new bound + boundaries.

Submission Checklist

If a section does not apply, mark the item N/A with a one-line reason. Do not delete items.

Tests added or updated (happy path + failure/edge case) — stalled-read, reset-on-chunk (happy), wedged-consumer (edge).
Diff coverage ≥ 80% — changed lines covered by the new Rust tests (verified locally via cargo-llvm-cov + diff-cover).
Coverage matrix updated — N/A: behaviour-only inference-runtime fix, no feature row added/removed/renamed.
All affected feature IDs listed under ## Related — N/A: no coverage-matrix feature ID applies.
No new external network dependencies introduced — the new tests use an in-process raw-TCP server + wiremock (already a dev-dep).
Manual smoke checklist updated if this touches release-cut surfaces — N/A: no release-manual-smoke surface changed.
Linked issue closed via Closes #NNN in ## Related.

Impact

Runtime impact: inference streaming path (all providers via the OpenAI-compatible provider), especially long research turns.
Performance impact: removes the indefinite RESPONSE-phase hang; adds one tokio::time::timeout per SSE chunk (negligible).
Tradeoff: a genuinely silent stream is now aborted+retried after the idle window instead of after the whole-request timeout. Retry degrades to non-streaming (existing behaviour). No token loss on the happy path (deltas still stream; the watchdog only fires on a full idle window).
No persistence, migration, security, or network-dependency changes.

Closes Research agent intermittently hangs mid-run #4269
Context: Feature Request: Add configuration options for all timeout values. #3856 (whole-request timeout raised for long research turns — the reason the stall presents as indefinite).
Alternative approach fix(agent): avoid RESPONSE stalls on progress backpressure #4356 (non-blocking delta forwarders) does not address the read-side stall reproduced here — see my review on that PR.

Summary by CodeRabbit

New Features
- Added a configurable stream-idle watchdog timeout for streamed responses via OPENHUMAN_INFERENCE_STREAM_IDLE_TIMEOUT_SECS (documented default and allowed range).
Bug Fixes
- Prevented watchdog false-failures by immediately completing when the terminal [DONE] sentinel is received.
- Improved handling of slow downstreams: dropped receivers no longer fail the stream, while sustained backpressure triggers a retryable watchdog error.
Tests
- Added coverage for stream-idle timeouts, watchdog reset on each chunk, mid-body upstream stalls, delta backpressure behavior, and [DONE] completion behavior.

coderabbitai · 2026-07-01T19:26:27Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 59ce25e7-4718-449e-a315-78bc329840d6

📥 Commits

Reviewing files that changed from the base of the PR and between 35be03b and 17f4eb8.

📒 Files selected for processing (5)

.env.example
src/openhuman/inference/provider/compatible.rs
src/openhuman/inference/provider/compatible_stream_native.rs
src/openhuman/inference/provider/compatible_tests.rs
src/openhuman/inference/provider/compatible_timeout.rs

✅ Files skipped from review due to trivial changes (1)

.env.example

🚧 Files skipped from review as they are similar to previous changes (4)

src/openhuman/inference/provider/compatible.rs
src/openhuman/inference/provider/compatible_stream_native.rs
src/openhuman/inference/provider/compatible_timeout.rs
src/openhuman/inference/provider/compatible_tests.rs

📝 Walkthrough

Walkthrough

Adds a configurable per-chunk streaming idle timeout, wires it into OpenAI-compatible streaming, applies watchdogs to SSE reads and delta forwarding, and adds coverage for stalled reads, backpressure, dropped receivers, and terminal completion.

Changes

Stream idle watchdog

Layer / File(s)	Summary
Stream idle timeout configuration `.env.example`, `src/openhuman/inference/provider/compatible_timeout.rs`	Adds `OPENHUMAN_INFERENCE_STREAM_IDLE_TIMEOUT_SECS` docs, bounds, cached resolution, and timeout tests.
Provider wiring for stream_idle_timeout `src/openhuman/inference/provider/compatible.rs`	Adds `stream_idle_timeout` to `OpenAiCompatibleProvider`, initializes it from config, and adds a test-only override.
SSE read and delta forwarding watchdog `src/openhuman/inference/provider/compatible_stream_native.rs`	Wraps SSE reads and delta sends in timeouts, adds `forward_delta`, and stops immediately on `[DONE]`.
Watchdog test coverage `src/openhuman/inference/provider/compatible_tests.rs`	Adds helpers and tests for stalled reads, chunk resets, wedged consumers, dropped receivers, and lingering sockets after `[DONE]`.

Estimated code review effort: 4 (Complex) | ~50 minutes

Sequence Diagram(s)

sequenceDiagram
  participant stream_native_chat
  participant bytes_stream
  participant forward_delta
  participant delta_tx

  loop per chunk
    stream_native_chat->>bytes_stream: next()
    alt chunk arrives before idle timeout
      bytes_stream-->>stream_native_chat: SSE chunk
      stream_native_chat->>forward_delta: send(delta)
      forward_delta->>delta_tx: send(delta)
      alt receiver closed
        delta_tx-->>forward_delta: dropped
        forward_delta-->>stream_native_chat: Ok(())
      else send succeeds
        delta_tx-->>forward_delta: sent
        forward_delta-->>stream_native_chat: Ok(())
      else backpressure timeout
        forward_delta-->>stream_native_chat: retryable error
      end
    else no chunk before idle timeout
      bytes_stream-->>stream_native_chat: timeout
      stream_native_chat-->>stream_native_chat: abort retryable watchdog error
    end
  end
  stream_native_chat->>stream_native_chat: stop on [DONE]

Poem

A rabbit watched the stream go still,
No token hopped across the hill.
Tick-tock went the idle clock,
And stalled streams found a safer dock. 🐇⏱️

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Linked Issues check	❓ Inconclusive	The watchdog fix and tests address the RESPONSE hang objective, but the 80% diff-coverage requirement cannot be verified from the provided context.	Provide coverage results or CI output confirming the changed-lines coverage meets the 80% gate.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title is concise and accurately describes the main change: adding an SSE inactivity watchdog to fix RESPONSE hangs.
Out of Scope Changes check	✅ Passed	All described changes support the SSE timeout fix, configuration, or tests; no unrelated code paths are evident.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/openhuman/inference/provider/compatible_stream_native.rs`:
- Around line 263-293: The SSE loop in compatible_stream_native.rs is still
treating the terminal `[DONE]` event as non-final, so it re-arms the idle
watchdog and can incorrectly fail a completed response if the socket stays open.
Update the stream-reading logic around the `bytes_stream.next()` loop and
`[DONE]` handling so the loop exits immediately on the terminal sentinel instead
of continuing to another read. Add a regression test in the same provider path
that emits `[DONE]` and then keeps the connection open to verify the stream
completes without tripping `stream_idle_timeout`.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 82583067-bccd-4429-9c9a-93da830c0165

📥 Commits

Reviewing files that changed from the base of the PR and between 4c98a31 and 5da7bfd.

📒 Files selected for processing (5)

.env.example
src/openhuman/inference/provider/compatible.rs
src/openhuman/inference/provider/compatible_stream_native.rs
src/openhuman/inference/provider/compatible_tests.rs
src/openhuman/inference/provider/compatible_timeout.rs

…e watchdog CodeRabbit review on tinyhumansai#4393: after the terminal [DONE] sentinel the loop re-armed the idle watchdog, so a provider that sends [DONE] but holds the socket open would fail an already-complete response as a retryable stall. Break on [DONE]; add a regression that sends [DONE] then keeps the connection open. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…SE hang (tinyhumansai#4269) The native streaming path parked on `bytes_stream.next().await` when an upstream flushed 200 then went silent mid-response — cut only by the blunt whole-request timeout, which tinyhumansai#3856 tells operators to raise up to 1h, turning the stall into an indefinite hang. Add a per-chunk inactivity watchdog that resets on every token and aborts a stalled stream with a retryable error (ReliableProvider replays it); also bound the delta send so a wedged consumer can't hang the turn on a full channel. New env knob OPENHUMAN_INFERENCE_STREAM_IDLE_TIMEOUT_SECS (default 90s, range 1-3600). Reproduced against staging via CDP: deep-research streams stalled the full 120s request-timeout window, surfacing as "error decoding response body". Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…e watchdog CodeRabbit review on tinyhumansai#4393: after the terminal [DONE] sentinel the loop re-armed the idle watchdog, so a provider that sends [DONE] but holds the socket open would fail an already-complete response as a retryable stall. Break on [DONE]; add a regression that sends [DONE] then keeps the connection open. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

M3gA-Mind

PR #4393 — SSE inactivity watchdog to stop research-agent RESPONSE hang (#4269)

Walkthrough

Wraps each SSE read in stream_native_chat with a per-chunk inactivity tokio::time::timeout (default 90s, env-overridable, resets on every chunk), and applies the same bound to each downstream delta send via a new forward_delta helper. A stall aborts as a retryable error so ReliableProvider replays the turn (degrading to non-streaming), instead of parking on next().await until the raised whole-request ceiling. Approach is sound, correctly scoped to the reproduced native path, and unusually well tested (stall, per-chunk reset, wedged consumer, dropped receiver, and the [DONE]-lingers regression). Solid work — no blockers from me.

Changes

File	Summary
`.env.example`	Documents new `OPENHUMAN_INFERENCE_STREAM_IDLE_TIMEOUT_SECS` knob.
`compatible.rs`	Adds `stream_idle_timeout` field + test-only `with_stream_idle_timeout` setter.
`compatible_stream_native.rs`	Read loop → watchdog-armed `loop`; `[DONE]` now `break` not `continue`; delta sends routed through new `forward_delta`.
`compatible_timeout.rs`	`stream_idle_timeout()` resolver (default 90, 1–3600) + unit tests.
`compatible_tests.rs`	5 integration tests incl. a raw scripted-SSE TCP server for deterministic stalls.

Actionable comments (2)

⚠️ Major / Question

1. `src/openhuman/inference/provider/compatible_stream.rs:32` — sibling SSE reader has the identical unguarded park

The other streaming path, sse_bytes_to_chunks, still does a bare while let Some(item) = bytes_stream.next().await with no inactivity bound. It's live — invoked from compatible_provider_impl.rs:1159 (stream_chat_with_system) and :1419 (stream_chat_with_history) — so the same upstream "flush 200 then go silent" stall that #4269 fixes on the native path can still hang a turn here until the whole-request ceiling.

Not necessarily blocking (the research-agent repro goes through stream_native_chat, which this PR covers), but worth confirming: are those two entrypoints reachable for the flows #4269 cares about? If yes, this fix is incomplete without the same watchdog there; if they're legacy/unused, a one-line note on the PR would close the scope question. A follow-up issue is a fine resolution either way.

💡 Consideration

2. `src/openhuman/inference/provider/compatible_timeout.rs:33` — 90s idle vs providers that buffer reasoning silently

The comment argues reasoning pauses are safe because thinking streams as reasoning_content deltas that reset the window. That holds for managed/OpenAI-style reasoning, but some providers emit no deltas during a long think and then flush the answer. Against those, a >90s silent think would trip the watchdog and force an (unnecessary) retry. It degrades gracefully — retryable → non-streaming replay still returns an answer — and it's env-tunable, so this is a heads-up, not a defect. Consider a one-line caveat in the rustdoc so an operator debugging "why did my slow local reasoning model retry" finds the knob quickly.

Nitpicks (1)

compatible_stream_native.rs — the read loop copies let idle_window = self.stream_idle_timeout; while forward_delta reads self.stream_idle_timeout directly. Harmless, but using one form in both spots reads cleaner.

Verified / looks good

Retryable classification is correct. Neither watchdog message ("no response data…" / "delta channel stalled…") matches any is_non_retryable hint (no structured 4xx, no auth/quota/session-expired substrings), and both stream_watchdog_trips_* tests assert !is_non_retryable. So ReliableProvider genuinely replays.
[DONE] → break 'stream is right, and the stream_stops_on_done_even_if_socket_lingers test nails the exact watchdog×terminal-sentinel interaction that would otherwise fail a completed response as a stall.
forward_delta preserves prior semantics for the two benign cases (success; dropped receiver → Ok), adding only the wedge timeout. The old let _ = send().await also awaited, so healthy consumers behave identically.
Bounds/env resolution reuse the existing resolve/parse_timeout_secs path; 0/out-of-range/typo all fall back to the default (can't accidentally disable the guard) — covered by stream_idle_parse_respects_bounds.
Timeout-cancelled bytes_stream.next() is dropped on the bail path, so there's no partial-read/data-loss concern.

Posted as a comment (not an approval) — leaving the approve/request-changes call to the maintainer.

sanil-23 requested a review from a team July 1, 2026 19:26

coderabbitai Bot requested changes Jul 1, 2026

View reviewed changes

Comment thread src/openhuman/inference/provider/compatible_stream_native.rs

coderabbitai Bot approved these changes Jul 1, 2026

View reviewed changes

sanil-23 force-pushed the fix/4269-research-response-watchdog branch from e0e5652 to 35be03b Compare July 2, 2026 06:53

sanil-23 mentioned this pull request Jul 2, 2026

fix(tinyagents): keep attempted tool name in the timeline for unavailable tools #4419

Merged

7 tasks

sanil-23 and others added 2 commits July 3, 2026 11:49

sanil-23 force-pushed the fix/4269-research-response-watchdog branch from 35be03b to 17f4eb8 Compare July 3, 2026 06:22

M3gA-Mind reviewed Jul 3, 2026

View reviewed changes

senamakel merged commit f1987a5 into tinyhumansai:main Jul 3, 2026
19 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(inference): SSE inactivity watchdog to stop research-agent RESPONSE hang (#4269)#4393

fix(inference): SSE inactivity watchdog to stop research-agent RESPONSE hang (#4269)#4393
senamakel merged 2 commits into
tinyhumansai:mainfrom
sanil-23:fix/4269-research-response-watchdog

sanil-23 commented Jul 1, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jul 1, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

❌ Failed checks (1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

M3gA-Mind left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

sanil-23 commented Jul 1, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

Tests

Submission Checklist

Impact

Related

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

❌ Failed checks (1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

M3gA-Mind left a comment

Choose a reason for hiding this comment

PR #4393 — SSE inactivity watchdog to stop research-agent RESPONSE hang (#4269)

Walkthrough

Changes

Actionable comments (2)

⚠️ Major / Question

1. src/openhuman/inference/provider/compatible_stream.rs:32 — sibling SSE reader has the identical unguarded park

💡 Consideration

2. src/openhuman/inference/provider/compatible_timeout.rs:33 — 90s idle vs providers that buffer reasoning silently

Nitpicks (1)

Verified / looks good

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sanil-23 commented Jul 1, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jul 1, 2026 •

edited

Loading

1. `src/openhuman/inference/provider/compatible_stream.rs:32` — sibling SSE reader has the identical unguarded park

2. `src/openhuman/inference/provider/compatible_timeout.rs:33` — 90s idle vs providers that buffer reasoning silently