fix(retriever): AbortSignal threading + embedQuery BM25 fallback (PR668/PR708 clean rebuild) by jlin53882 · Pull Request #784 · CortexReach/memory-lancedb-pro

jlin53882 · 2026-05-09T17:22:21Z

Summary

Rebuild of PRs #668 and #708 with cleaner scope — fixing the "stuck session" issue (embedQuery HTTP call holds lock after timeout) and adding BM25 fallback when embedding fails.

Problem

PR #668: Session lock held after timeout

Auto-recall uses Promise.race(timeout, recallWork()) — when timeout resolves after 5s, the hook returns undefined but the in-flight embedQuery HTTP call keeps running. This holds the per-agent memory-runtime mutex for minutes.

PR #708: Auto-recall always uses BM25 (quality regression)

PR #708 tried to fix the above by forcing auto-recall to BM25 mode (no embedding, no network). But this changes the default behavior — manual/CLI retrieval still uses hybrid, while auto-recall always uses BM25. Quality regression for normal auto-recall scenarios.

Solution (James's design)

Instead of always using BM25 for auto-recall, use BM25 as a fallback when embedding fails:

Normal path: vector/hybrid + signal threading
  → embedQuery(query, signal) — can be cancelled by AbortController

Error path: if embedQuery fails → fallback to BM25
  → no embedding needed, no network, immediate result

Important: When the signal is already aborted (deliberate cancel), we re-throw immediately — no BM25 fallback. The BM25 fallback is reserved for non-abort errors (network failures, API errors, genuine embedding failures).

Changes

File	Change
`src/retriever.ts`	Add `signal?: AbortSignal` to `RetrievalContext`
`src/retriever.ts`	Add `signal` param to `vectorOnlyRetrieval` + `hybridRetrieval`
`src/retriever.ts`	Pass `signal` to `embedQuery(query, signal)`
`src/retriever.ts`	Add try/catch in embedQuery: if `signal?.aborted` → re-throw; else → BM25 fallback
`src/retriever.ts`	Remove `useLightweightAutoRecall` (replaced by fallback design)
`src/retriever.ts`	Fix `retrieveWithTrace()` missing mode/diagnostics declarations
`test/retriever-auto-recall-signal.test.mjs`	Unit tests for signal propagation + abort/fallback behavior
`scripts/ci-test-manifest.mjs`	Register test in `core-regression`

Key behaviors

Signal threading: AbortController.signal flows from RetrievalContext.signal → embedQuery(query, signal) → HTTP request is cancelled on abort
AbortError → re-throw, no fallback: If embedQuery throws because the signal was aborted, re-throw immediately (don't waste resources on BM25)
Non-abort error → BM25 fallback: If embedQuery throws for other reasons (network/API), try BM25 instead. Only re-throws if BM25 also returns no results.
No behavior change for normal cases: If embedQuery completes normally, retrieval works exactly as before

Comparison

	PR #668 (original)	PR #708	This PR
Cancel embed on timeout	✅	❌	✅
Auto-recall always BM25	❌	✅	❌
Embed fail → BM25 fallback	❌	❌	✅
Abort signal → re-throw (no waste)	❌	❌	✅
Backward compatible	✅	❌	✅
Scope	index.ts + retriever.ts	retriever.ts only	retriever.ts only

Review fix (v2)

After PR #784 review by TurboTheTurtle, added explicit abort check before BM25 fallback:

} catch (embedError) {
  // Only do BM25 fallback for non-abort errors
  if (signal?.aborted) {
    throw embedError;  // ← New: re-throw for abort, no waste
  }
  // BM25 fallback for genuine failures
  ...
}

…68/PR708 clean rebuild) Rebuild of PRs CortexReach#668 and CortexReach#708 with cleaner scope. Changes: - Add signal?: AbortSignal to RetrievalContext interface - Add signal param to vectorOnlyRetrieval + hybridRetrieval - Pass signal to embedQuery(query, signal) for true cancellation - Add try/catch in embedQuery: catch → BM25 fallback (James's design) - Remove useLightweightAutoRecall (replaced by fallback design) - Fix retrieveWithTrace() missing mode/diagnostics declarations Design: normal path uses vector/hybrid; if embed fails (timeout/network/API error), fallback to BM25 instead of returning nothing. Doesn't change default behavior. Tests: - retriever-auto-recall-signal.test.mjs: signal propagation + BM25 fallback Refs: PR CortexReach#668, PR CortexReach#708, PR CortexReach#746

TurboTheTurtle · 2026-05-10T01:24:23Z

Review finding:

P2: aborted embedding requests still continue into BM25 fallback work

The new signal is passed into embedQuery(), but the catch blocks treat every embedding failure the same and immediately run BM25 fallback:

src/retriever.ts:742-747
src/retriever.ts:950-955

If the caller aborts the signal because auto-recall timed out, this path still performs store reads and may return fallback results instead of stopping promptly. That undercuts the cancellation goal and can keep the session lock active after the timeout. Abort errors should probably be rethrown or explicitly short-circuited before attempting BM25 fallback; the fallback should be reserved for non-abort embedding failures.

When embedQuery() throws due to an abort signal, we now re-throw immediately instead of wasting resources doing BM25 fallback retrieval. Only BM25 fallback is attempted for non-abort errors (network/API failures). Fixes PR784 review comment by TurboTheTurtle: "aborted embedding requests still continue into BM25 fallback work" Updated tests: - AbortError re-thrown in vectorOnlyRetrieval (signal already aborted) - AbortError re-thrown in hybridRetrieval (signal already aborted) - BM25 fallback used for non-abort errors (network error, signal not aborted)

jlin53882 · 2026-05-10T15:57:06Z

Review fix applied ✅

已根據你的 comment 修復。

問題

} catch (embedError) {
  // 所有的 embed error 都走這裡 — 包括 AbortError
  const bm25Results = await this.bm25OnlyRetrieval(...);  // ← 不該跑的浪費
  ...
}

當用戶超時取消（controller.abort()）時，embedQuery 拋出 AbortError。原本的實作在這個 catch 裡又去做 BM25 fallback——浪費資源又違背取消的原意。

修復

在 catch 區塊最前面加了一個 guard：

} catch (embedError) {
  // Only do BM25 fallback for non-abort errors.
  // If the caller aborted the signal, re-throw immediately.
  if (signal?.aborted) {
    throw embedError;  // ← 新的判斷
  }
  // Fallback to BM25 for genuine failures (network/API errors)
  ...
}

三種情境現在的行為

情境	embedQuery 行為	catch 區塊行為
正常完成	✅ 返回 vector	不進 catch
超時取消（signal.aborted）	throw AbortError	✅ re-throw，不做 BM25
網路 / API 失敗	throw Error	✅ BM25 fallback

commit: e1d80ab

jlin53882 · 2026-05-10T16:02:08Z

為什麼用 BM25 fallback 而不是 retry？

好問題。這是刻意的設計取捨。

兩個選項

Retry（重試）：embedQuery 失敗 → 等一下再試 → 可能成功 → 可能還是失敗 → 使用者等到超時還是沒結果

BM25 Fallback：embedQuery 失敗 → 立刻放棄嵌入 → 用本地 BM25（不需要網路）→ 毫秒級返回結果

為什麼在 auto-recall 語境下選 BM25 fallback

我們已經在超時情境下：auto-recall 用 Promise.race(timeout, recallWork()) 限制了時間（預設 5 秒）。如果 embedQuery 超時了，retry 大概率也會超時。
使用者體驗：重試失敗後，使用者拿到的是「超時無結果」。BM25 fallback 起碼有東西可以回。
Session lock 盡快釋放：原本的問題就是 embedQuery HTTP 佔住 session lock 幾分鐘。重試只會讓這段佔用更長。

BM25 fallback 的代價

品質稍微低一點（語意 vs 關鍵字）。但：

服務立刻可用（不是「完全沒回應」）
Session lock 立刻釋放（不會卡住其他操作）

什麼時候適合 retry？

沒有超時限制的場景：CLI 查詢、手動檢索（沒有 5 秒限制）
錯誤是明確可復原的：如「連接數过多」這類短暫錯誤

在 auto-recall 的語境下，BM25 fallback 是合理選擇。

jlin53882 · 2026-05-10T16:32:25Z

Code quality fixes — defensive coding

Added null-check guards in the BM25 fallback path (two locations) to prevent potential runtime crashes if diagnostics is undefined.

Changes

vectorOnlyRetrieval (line 751-761)

// Before
diagnostics.bm25Query = query;
...
diagnostics.dropSummary = buildDropSummary(diagnostics);
this.lastDiagnostics = diagnostics;

// After
if (diagnostics) diagnostics.bm25Query = query;
...
if (diagnostics) {
  diagnostics.dropSummary = buildDropSummary(diagnostics);
  this.lastDiagnostics = diagnostics;
}

hybridRetrieval (line 968-976) — same pattern applied.

Why

diagnostics is optional in RetrievalContext. Currently retrieve() always constructs it before calling vectorOnlyRetrieval / hybridRetrieval, so this never crashes today. But:

These are private methods — future refactors or copy-paste could call them without diagnostics
Adding the guard makes the code more defensive and self-documenting

Severity

🟡 Low — no functional bug today, purely defensive. Does not affect the core signal-threading or BM25 fallback logic. No test changes needed.

Commit: f649c66 — fix(retriever): add diagnostics null-check guards in BM25 fallback (defensive coding)

OKmiao88888888 · 2026-05-11T03:38:09Z

We're experiencing the exact issue described in this PR — auto-recall timing out after 30s and stale locks persisting across sessions. This is happening in our production environment on OpenClaw 2026.5.x. Would love to see this merged soon! Thanks for the great work.

TurboTheTurtle · 2026-05-11T04:08:16Z

Thanks for confirming the production impact. For maintainers: this report lines up with the timeout/stale-session behavior that #784 and #790 address from complementary angles. #784 makes cancellation propagate through retrieval, while #790 prevents late auto-recall work from being injected after the timeout boundary. I’d treat these as linked fixes for the OpenClaw 2026.5.x symptom.

rwmjhb · 2026-05-17T11:40:00Z

Please rebase onto latest origin/master and resolve scripts/ci-test-manifest.mjs by preserving all current master entries and only adding this PR's new test entry. Do not reorder or remove unrelated entries.

After resolving, please run:

npm run test:packaging-and-workflow
npm run test:core-regression

This fix is high priority once the branch is clean and checks are green, because it lines up with the reported production timeout/stale-lock symptom. Ping when CI is green.

Issue CortexReach#786 findings addressed: - upgrader: text now uses l2_content (not l0_abstract) for BM25/FTS - compactor: buildMergedEntry produces proper L0/L1/L2 metadata - retriever: rerank topN capped at candidatePoolSize PR: CortexReach#789

…o buildRerankRequest This ensures ALL rerank providers are limited to candidatePoolSize, not just those that support topN in their API body (jina/pinecone/voyage). tei and dashscope ignore the topN parameter, so previously they would receive all candidates regardless of candidatePoolSize setting. Now documents are sliced BEFORE mapping, ensuring tei/dashscope also receive limited candidates (even though they ignore topN). Issue: CortexReach#789 (follow-up)

…rankRequest - Document tei/dashscope/pinecone/voyage/jina encoding differences - Note that documents are sliced before this function for all providers

…ession group Fixes maintainer review finding: GitHub Actions runs ci-test-manifest.mjs, not npm test, so the test was never exercised in PR CI. Issue: CortexReach#786 PR: CortexReach#789

…eMapLegacyCategory F1 fix: Compactor now converts legacy category to new memory_category format using reverseMapLegacyCategory() before writing to smart metadata. This ensures: - Legacy categories (preference/Fact/...) are mapped to new format (preferences/facts/entities/...) - Smart metadata structure is consistent - scoreLexicalHit reads correct memory_category without conflict Issue: CortexReach#786 Fixes: PR#789 review F1 (CHANGES_REQUESTED) PR: CortexReach#789

…ew fairness + tests All 6 Nice-to-Have items from PR CortexReach#789 review: - F2/MR3: tier inherited from highest-priority source (core > working > peripheral) - MR2: access_count inherited as average of source members (not reset to 0) - MR4: l0_abstract truncation uses safeTruncate() to avoid splitting UTF-8 chars - MR5: l1_overview samples across all members fairly (not just members[0]) - MR1: new upgrader test covers l2_content success path (LLM-enriched entry) - F3: rerank candidatePoolSize bounded in DEFAULT_RETRIEVAL_CONFIG (smoke test) Also install jiti devDependency (was missing, blocking test runs) and bump to latest jiti v2.7.0. F1 (buildSmartMetadata call in buildMergedEntry) was already present. Existing 23 compactor tests + 11 fix tests all pass. Refs: CortexReach#786

…ugin.json location Fixes Windows path bug where path.resolve(new URL().pathname) produced 'C:\\C:\\...' double drive letter path. Uses existing testDir constant from fileURLToPath(import.meta.url) instead.

jlin53882 · 2026-05-18T16:24:35Z

PR #784 Conflict Resolution Summary

Completed Fixes

Conflict resolved: Rebased PR fix(retriever): AbortSignal threading + embedQuery BM25 fallback (PR668/PR708 clean rebuild) #784 onto latest master, resolved scripts/ci-test-manifest.mjs conflict by keeping all master entries and adding the PR's new test.
Test file repaired: Fixed test/retriever-auto-recall-signal.test.mjs UTF-16 encoding issue (local file corruption on disk), converted back to UTF-8.
Manifest registered: Added test/retriever-auto-recall-signal.test.mjs to core-regression group in ci-test-manifest.mjs.

All PR Description Changes Implemented

Change	Status
`signal?: AbortSignal` in `RetrievalContext`	✅
`signal` threading to `embedQuery`	✅
`signal?.aborted` -> re-throw (skip BM25)	✅
Non-abort error -> BM25 fallback	✅
Removed `useLightweightAutoRecall`	✅
Unit tests (6/6 pass)	✅

CI `core-regression` Failure Explanation

The 4 failing tests are all in test/recall-text-cleanup.test.mjs (not modified by PR #784):

Test 11: "auto-recall only injects confirmed non-archived memories"
Test 19: "uses configured categoryField as display category..."
Test 20: "falls back to built-in category..."
Test 21: "uses built-in category unchanged..."

Important: This test file has no changes in PR #784 (git diff master -- test/recall-text-cleanup.test.mjs = empty). This is a pre-existing CI issue.

Local verification (Windows):

recall-text-cleanup.test.mjs: 22/22 ✅
retriever-auto-recall-signal.test.mjs: 6/6 ✅
tier1-counters.test.mjs: 35/35 ✅

Suggest marking these failures as pre-existing or address separately.

SHA: 98f5769 | Conflicts resolved

jlin53882 · 2026-05-18T16:31:33Z

CI core-regression Failure Analysis

Which tests are failing

4 subtests in test/recall-text-cleanup.test.mjs:

Test 11: "auto-recall only injects confirmed non-archived memories"
Test 19: "uses configured categoryField as display category when field is present in metadata"
Test 20: "falls back to built-in category when categoryField is configured but absent from metadata"
Test 21: "uses built-in category unchanged when recallPrefix.categoryField is not configured"

Root cause

These tests expect certain memories (Goals, short commit messages) to be returned from auto-recall hook, but only legacy memories ([W][cases:global], [W][preferences:global]) are actually returned. The test assertions expect text like /confirmed durable memory/ or /prefer short commit messages/ but get none of that.

This is NOT caused by PR #784

Proof: git diff e1d80ab..HEAD -- test/recall-text-cleanup.test.mjs = empty. Zero changes to this file in PR #784.

This is a pre-existing CI issue from the master branch that was already present before PR #784 was created. The tests were already flaky/broken before this PR.

Recommendation

These failures should be addressed separately from PR #784 (either by fixing the test data/setup, or marking them as known issues). PR #784 does not introduce any changes to recall-text-cleanup.test.mjs.

For maintainers: please review the test/recall-text-cleanup.test.mjs test data setup - the auto-recall hook is not returning the expected test entries.

choucheyu · 2026-05-19T07:57:47Z

Thanks for the work on this. We hit the same class of symptom downstream on OpenClaw 2026.5.x: auto-recall/retrieval can surface This operation was aborted, and retrying later succeeds, which points at the embedding request timeout/cancellation path rather than permanently bad memory data.

I think the retriever-side behavior in this PR is the right shape:

pass AbortSignal down into embedQuery()
if the caller's signal is already aborted, rethrow/stop immediately
if embedding fails for a non-caller-abort reason, degrade to BM25 so recall/search can still return local results

One possible remaining gap: this PR does not appear to modify index.ts, so it enables cancellation support in the retriever but may not itself wire the auto-recall timeout to an AbortController at the caller. If auto-recall still uses a Promise.race(timeout, recallWork()) style timeout without passing a real signal into retriever.retrieve(), then the timeout winner can skip injection while the underlying embedding HTTP request continues in the background.

In our local downstream test, the complementary caller-side change that fixed that part was:

create an AbortController for auto-recall work
pass signal through retrieveWithRetry() into retriever.retrieve({ ..., signal })
call controller.abort() in the auto-recall timeout callback
make retry sleep abort-aware so it does not retry after timeout

So my read is: #784 covers the retriever fallback/cancellation semantics, but the full OpenClaw 2026.5.x timeout/stale-work symptom likely also needs caller-side auto-recall timeout wiring in index.ts (or an equivalent caller path) to actually abort the in-flight retrieval.

…aller-side)

rwmjhb · 2026-05-23T09:08:31Z

PR #784 Review: fix(retriever): AbortSignal threading + embedQuery BM25 fallback (PR668/PR708 clean rebuild)

Verdict: RESOLVE-CONFLICTS-FIRST | Author: jlin53882 | Merge state: DIRTY

Pipeline short-circuited at the conflict gate after R0 verification. Deep review deferred until the branch rebases cleanly onto the base.

Problem Statement (R1)

Auto-recall can time out while the underlying embedding request keeps running, holding retrieval/session resources and causing stale lock or stuck-session behavior. The PR also aims to degrade to local BM25 retrieval when embedding fails for non-abort reasons.

Why This Stopped Here

GitHub reports mergeable=CONFLICTING (merge_state_status=DIRTY) for this PR. Reviewing the diff now would:

Give feedback against a branch the author must rewrite anyway
Produce findings that may be invalidated by the conflict resolution
Waste review cycles on code that cannot be merged as-is

Recommended Action

Author should:

Rebase this branch onto the latest base (or merge the base into this branch)
Resolve all merge conflicts
Push the rebased branch — the re-review will be picked up automatically

Reviewed at 2026-05-23T09:07:40Z | R0+R1 gate | Conflict gate

This was referenced May 9, 2026

fix(auto-recall): thread AbortSignal + BM25-only mode + fix post-timeout rejection #746

Closed

fix(auto-recall): thread AbortSignal to cancel embedding on timeout #668

Closed

fix: restore CLI compat and unblock auto-recall latency #708

Closed

app3apps mentioned this pull request May 12, 2026

fix(import-markdown): batch pipeline Phase 1+2a+2b + retrieve() dedup + PR719 log alignment (#730+#732+#719+#720) #735

Closed

jlin53882 added 9 commits May 18, 2026 22:50

Merge master into pr-784 (resolve manifest conflict, keep new test file)

a984d7c

docs(retriever): add provider encoding differences comment to buildRe…

ef969ef

…rankRequest - Document tei/dashscope/pinecone/voyage/jina encoding differences - Note that documents are sliced before this function for all providers

fix(CI): wire upgrader-compactor-rerank-fixes.test.mjs into core-regr…

157c1ff

…ession group Fixes maintainer review finding: GitHub Actions runs ci-test-manifest.mjs, not npm test, so the test was never exercised in PR CI. Issue: CortexReach#786 PR: CortexReach#789

fix(tier1-counters): use testDir instead of path hack for openclaw.pl…

2e5d321

…ugin.json location Fixes Windows path bug where path.resolve(new URL().pathname) produced 'C:\\C:\\...' double drive letter path. Uses existing testDir constant from fileURLToPath(import.meta.url) instead.

fix(test): restore corrupted test file + add to manifest

98f5769

fix(index): add AbortController signal wiring to auto-recall (PR668 c…

8515507

…aller-side)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(retriever): AbortSignal threading + embedQuery BM25 fallback (PR668/PR708 clean rebuild)#784

fix(retriever): AbortSignal threading + embedQuery BM25 fallback (PR668/PR708 clean rebuild)#784
jlin53882 wants to merge 12 commits into
CortexReach:masterfrom
jlin53882:james/pr783-truly-clean

jlin53882 commented May 9, 2026 •

edited

Loading

Uh oh!

TurboTheTurtle commented May 10, 2026

Uh oh!

jlin53882 commented May 10, 2026

Uh oh!

jlin53882 commented May 10, 2026

Uh oh!

jlin53882 commented May 10, 2026

Uh oh!

OKmiao88888888 commented May 11, 2026

Uh oh!

TurboTheTurtle commented May 11, 2026

Uh oh!

rwmjhb commented May 17, 2026

Uh oh!

jlin53882 commented May 18, 2026

Uh oh!

jlin53882 commented May 18, 2026

Uh oh!

choucheyu commented May 19, 2026

Uh oh!

rwmjhb commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

jlin53882 commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

PR #668: Session lock held after timeout

PR #708: Auto-recall always uses BM25 (quality regression)

Solution (James's design)

Changes

Key behaviors

Comparison

Review fix (v2)

Uh oh!

TurboTheTurtle commented May 10, 2026

Uh oh!

jlin53882 commented May 10, 2026

Review fix applied ✅

問題

修復

三種情境現在的行為

Uh oh!

jlin53882 commented May 10, 2026

為什麼用 BM25 fallback 而不是 retry？

兩個選項

為什麼在 auto-recall 語境下選 BM25 fallback

BM25 fallback 的代價

什麼時候適合 retry？

Uh oh!

jlin53882 commented May 10, 2026

Code quality fixes — defensive coding

Changes

Why

Severity

Uh oh!

OKmiao88888888 commented May 11, 2026

Uh oh!

TurboTheTurtle commented May 11, 2026

Uh oh!

rwmjhb commented May 17, 2026

Uh oh!

jlin53882 commented May 18, 2026

PR #784 Conflict Resolution Summary

Completed Fixes

All PR Description Changes Implemented

CI core-regression Failure Explanation

Uh oh!

jlin53882 commented May 18, 2026

CI core-regression Failure Analysis

Which tests are failing

Root cause

This is NOT caused by PR #784

Recommendation

Uh oh!

choucheyu commented May 19, 2026

Uh oh!

rwmjhb commented May 23, 2026

PR #784 Review: fix(retriever): AbortSignal threading + embedQuery BM25 fallback (PR668/PR708 clean rebuild)

Problem Statement (R1)

Why This Stopped Here

Recommended Action

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jlin53882 commented May 9, 2026 •

edited

Loading

CI `core-regression` Failure Explanation