fix(retriever): AbortSignal threading + embedQuery BM25 fallback (PR668/PR708 clean rebuild)#784
fix(retriever): AbortSignal threading + embedQuery BM25 fallback (PR668/PR708 clean rebuild)#784jlin53882 wants to merge 12 commits into
Conversation
…68/PR708 clean rebuild) Rebuild of PRs CortexReach#668 and CortexReach#708 with cleaner scope. Changes: - Add signal?: AbortSignal to RetrievalContext interface - Add signal param to vectorOnlyRetrieval + hybridRetrieval - Pass signal to embedQuery(query, signal) for true cancellation - Add try/catch in embedQuery: catch → BM25 fallback (James's design) - Remove useLightweightAutoRecall (replaced by fallback design) - Fix retrieveWithTrace() missing mode/diagnostics declarations Design: normal path uses vector/hybrid; if embed fails (timeout/network/API error), fallback to BM25 instead of returning nothing. Doesn't change default behavior. Tests: - retriever-auto-recall-signal.test.mjs: signal propagation + BM25 fallback Refs: PR CortexReach#668, PR CortexReach#708, PR CortexReach#746
|
Review finding: P2: aborted embedding requests still continue into BM25 fallback work The new
If the caller aborts the signal because auto-recall timed out, this path still performs store reads and may return fallback results instead of stopping promptly. That undercuts the cancellation goal and can keep the session lock active after the timeout. Abort errors should probably be rethrown or explicitly short-circuited before attempting BM25 fallback; the fallback should be reserved for non-abort embedding failures. |
When embedQuery() throws due to an abort signal, we now re-throw immediately instead of wasting resources doing BM25 fallback retrieval. Only BM25 fallback is attempted for non-abort errors (network/API failures). Fixes PR784 review comment by TurboTheTurtle: "aborted embedding requests still continue into BM25 fallback work" Updated tests: - AbortError re-thrown in vectorOnlyRetrieval (signal already aborted) - AbortError re-thrown in hybridRetrieval (signal already aborted) - BM25 fallback used for non-abort errors (network error, signal not aborted)
Review fix applied ✅已根據你的 comment 修復。 問題} catch (embedError) {
// 所有的 embed error 都走這裡 — 包括 AbortError
const bm25Results = await this.bm25OnlyRetrieval(...); // ← 不該跑的浪費
...
}當用戶超時取消( 修復在 catch 區塊最前面加了一個 guard: } catch (embedError) {
// Only do BM25 fallback for non-abort errors.
// If the caller aborted the signal, re-throw immediately.
if (signal?.aborted) {
throw embedError; // ← 新的判斷
}
// Fallback to BM25 for genuine failures (network/API errors)
...
}三種情境現在的行為
commit: |
為什麼用 BM25 fallback 而不是 retry?好問題。這是刻意的設計取捨。 兩個選項Retry(重試):embedQuery 失敗 → 等一下再試 → 可能成功 → 可能還是失敗 → 使用者等到超時還是沒結果 BM25 Fallback:embedQuery 失敗 → 立刻放棄嵌入 → 用本地 BM25(不需要網路)→ 毫秒級返回結果 為什麼在 auto-recall 語境下選 BM25 fallback
BM25 fallback 的代價品質稍微低一點(語意 vs 關鍵字)。但:
什麼時候適合 retry?
在 auto-recall 的語境下,BM25 fallback 是合理選擇。 |
Code quality fixes — defensive codingAdded null-check guards in the BM25 fallback path (two locations) to prevent potential runtime crashes if ChangesvectorOnlyRetrieval (line 751-761) // Before
diagnostics.bm25Query = query;
...
diagnostics.dropSummary = buildDropSummary(diagnostics);
this.lastDiagnostics = diagnostics;
// After
if (diagnostics) diagnostics.bm25Query = query;
...
if (diagnostics) {
diagnostics.dropSummary = buildDropSummary(diagnostics);
this.lastDiagnostics = diagnostics;
}hybridRetrieval (line 968-976) — same pattern applied. Why
Severity🟡 Low — no functional bug today, purely defensive. Does not affect the core signal-threading or BM25 fallback logic. No test changes needed. Commit: |
|
We're experiencing the exact issue described in this PR — auto-recall timing out after 30s and stale locks persisting across sessions. This is happening in our production environment on OpenClaw 2026.5.x. Would love to see this merged soon! Thanks for the great work. |
|
Thanks for confirming the production impact. For maintainers: this report lines up with the timeout/stale-session behavior that #784 and #790 address from complementary angles. #784 makes cancellation propagate through retrieval, while #790 prevents late auto-recall work from being injected after the timeout boundary. I’d treat these as linked fixes for the OpenClaw 2026.5.x symptom. |
|
Please rebase onto latest After resolving, please run: npm run test:packaging-and-workflow
npm run test:core-regressionThis fix is high priority once the branch is clean and checks are green, because it lines up with the reported production timeout/stale-lock symptom. Ping when CI is green. |
Issue CortexReach#786 findings addressed: - upgrader: text now uses l2_content (not l0_abstract) for BM25/FTS - compactor: buildMergedEntry produces proper L0/L1/L2 metadata - retriever: rerank topN capped at candidatePoolSize PR: CortexReach#789
…o buildRerankRequest This ensures ALL rerank providers are limited to candidatePoolSize, not just those that support topN in their API body (jina/pinecone/voyage). tei and dashscope ignore the topN parameter, so previously they would receive all candidates regardless of candidatePoolSize setting. Now documents are sliced BEFORE mapping, ensuring tei/dashscope also receive limited candidates (even though they ignore topN). Issue: CortexReach#789 (follow-up)
…rankRequest - Document tei/dashscope/pinecone/voyage/jina encoding differences - Note that documents are sliced before this function for all providers
…ession group Fixes maintainer review finding: GitHub Actions runs ci-test-manifest.mjs, not npm test, so the test was never exercised in PR CI. Issue: CortexReach#786 PR: CortexReach#789
…eMapLegacyCategory F1 fix: Compactor now converts legacy category to new memory_category format using reverseMapLegacyCategory() before writing to smart metadata. This ensures: - Legacy categories (preference/Fact/...) are mapped to new format (preferences/facts/entities/...) - Smart metadata structure is consistent - scoreLexicalHit reads correct memory_category without conflict Issue: CortexReach#786 Fixes: PR#789 review F1 (CHANGES_REQUESTED) PR: CortexReach#789
…ew fairness + tests All 6 Nice-to-Have items from PR CortexReach#789 review: - F2/MR3: tier inherited from highest-priority source (core > working > peripheral) - MR2: access_count inherited as average of source members (not reset to 0) - MR4: l0_abstract truncation uses safeTruncate() to avoid splitting UTF-8 chars - MR5: l1_overview samples across all members fairly (not just members[0]) - MR1: new upgrader test covers l2_content success path (LLM-enriched entry) - F3: rerank candidatePoolSize bounded in DEFAULT_RETRIEVAL_CONFIG (smoke test) Also install jiti devDependency (was missing, blocking test runs) and bump to latest jiti v2.7.0. F1 (buildSmartMetadata call in buildMergedEntry) was already present. Existing 23 compactor tests + 11 fix tests all pass. Refs: CortexReach#786
…ugin.json location Fixes Windows path bug where path.resolve(new URL().pathname) produced 'C:\\C:\\...' double drive letter path. Uses existing testDir constant from fileURLToPath(import.meta.url) instead.
PR #784 Conflict Resolution SummaryCompleted Fixes
All PR Description Changes Implemented
CI
|
CI core-regression Failure AnalysisWhich tests are failing4 subtests in
Root causeThese tests expect certain memories (Goals, short commit messages) to be returned from This is NOT caused by PR #784Proof: This is a pre-existing CI issue from the master branch that was already present before PR #784 was created. The tests were already flaky/broken before this PR. RecommendationThese failures should be addressed separately from PR #784 (either by fixing the test data/setup, or marking them as known issues). PR #784 does not introduce any changes to For maintainers: please review the |
|
Thanks for the work on this. We hit the same class of symptom downstream on OpenClaw 2026.5.x: auto-recall/retrieval can surface I think the retriever-side behavior in this PR is the right shape:
One possible remaining gap: this PR does not appear to modify In our local downstream test, the complementary caller-side change that fixed that part was:
So my read is: #784 covers the retriever fallback/cancellation semantics, but the full OpenClaw 2026.5.x timeout/stale-work symptom likely also needs caller-side auto-recall timeout wiring in |
PR #784 Review: fix(retriever): AbortSignal threading + embedQuery BM25 fallback (PR668/PR708 clean rebuild)Verdict: RESOLVE-CONFLICTS-FIRST | Author: jlin53882 | Merge state: DIRTY
Problem Statement (R1)Auto-recall can time out while the underlying embedding request keeps running, holding retrieval/session resources and causing stale lock or stuck-session behavior. The PR also aims to degrade to local BM25 retrieval when embedding fails for non-abort reasons. Why This Stopped HereGitHub reports
Recommended ActionAuthor should:
Reviewed at 2026-05-23T09:07:40Z | R0+R1 gate | Conflict gate |
Summary
Rebuild of PRs #668 and #708 with cleaner scope — fixing the "stuck session" issue (embedQuery HTTP call holds lock after timeout) and adding BM25 fallback when embedding fails.
Problem
PR #668: Session lock held after timeout
Auto-recall uses
Promise.race(timeout, recallWork())— when timeout resolves after 5s, the hook returnsundefinedbut the in-flight embedQuery HTTP call keeps running. This holds the per-agent memory-runtime mutex for minutes.PR #708: Auto-recall always uses BM25 (quality regression)
PR #708 tried to fix the above by forcing auto-recall to BM25 mode (no embedding, no network). But this changes the default behavior — manual/CLI retrieval still uses hybrid, while auto-recall always uses BM25. Quality regression for normal auto-recall scenarios.
Solution (James's design)
Instead of always using BM25 for auto-recall, use BM25 as a fallback when embedding fails:
Important: When the signal is already aborted (deliberate cancel), we re-throw immediately — no BM25 fallback. The BM25 fallback is reserved for non-abort errors (network failures, API errors, genuine embedding failures).
Changes
src/retriever.tssignal?: AbortSignaltoRetrievalContextsrc/retriever.tssignalparam tovectorOnlyRetrieval+hybridRetrievalsrc/retriever.tssignaltoembedQuery(query, signal)src/retriever.tssignal?.aborted→ re-throw; else → BM25 fallbacksrc/retriever.tsuseLightweightAutoRecall(replaced by fallback design)src/retriever.tsretrieveWithTrace()missing mode/diagnostics declarationstest/retriever-auto-recall-signal.test.mjsscripts/ci-test-manifest.mjscore-regressionKey behaviors
AbortController.signalflows fromRetrievalContext.signal→embedQuery(query, signal)→ HTTP request is cancelled on abortembedQuerythrows because the signal was aborted, re-throw immediately (don't waste resources on BM25)embedQuerythrows for other reasons (network/API), try BM25 instead. Only re-throws if BM25 also returns no results.Comparison
Review fix (v2)
After PR #784 review by TurboTheTurtle, added explicit abort check before BM25 fallback: