What this is
A note on a known sensitivity in app/src/app/api/compile/match/route.ts:172, not a prescribed fix. Anyone who wants to take it on should treat the rest of this as context, not a spec.
The shape of the problem
The TF-IDF cosine floor at Phase 3 (Match) is currently a single hardcoded 0.15. The value is the right floor for some source shapes and the wrong floor for others, and the right answer for mixed corpora isn't obvious.
What "0.15" does:
- For short web-heavy sources (blog posts, READMEs, articles in the 2-10K char range), cosine similarities against existing wiki pages sit comfortably in the 0.2-0.5 band. 0.15 is below the noise band — it catches real candidates and the triage LLM filters precision after.
- For long rigor-heavy sources (academic papers, technical reports, multi-thousand-word dense essays), the same calculation produces similarities that compress downward — 0.08-0.25 range. The 0.15 floor sits inside the legitimate-match band, so real matches start landing below it. The "EXPLAINING THE FAVORITE-LONGSHOT BIAS" case in the inline comment at
match/route.ts:162-170 is exactly this: 58K-char paper vs 20-page corpus, correct matches at 0.208 / 0.185 / 0.137 / 0.118, two of them under what used to be the 0.3 floor (which is what triggered the recalibration to 0.15).
- The lower the floor, the more triage LLM calls per source; the
slice(0, 3) cap keeps the worst case at sourceCount × 3 triage calls. Triage is the cheapest LLM call site in the pipeline (thinking_budget=0, max_output_tokens=512), so the cost of being too permissive is bounded and visible in the daily-cap accounting. The cost of being too strict is invisible — real matches silently never reach triage.
Why "just make it content-type-aware" isn't trivially the answer
The naive fix is a length branch — short content gets 0.15, long content gets something lower. That works for homogeneous sessions (a batch of all-academic-papers, or all-blog-posts). It doesn't cleanly work for the realistic case, which is mixed content in the same compile session and the same wiki.
Three things make mixed content hard:
-
The wiki page corpus is shared. When a short web source and a long academic source both run TF-IDF against the same set of existing pages, the similarities they produce live on different scales. A 0.18 from the short source and a 0.18 from the long source don't mean the same thing — but the comparison is happening against the same pages. Length-branching the source's threshold doesn't normalize across this; it just makes each source's recall criterion source-shape-aware.
-
The threshold isn't the only knob. slice(0, 3) caps the candidate set per source; raising the cap on long-source paths is another way to reach the same outcome. The right answer might be "lower floor AND larger top-K for long content" or "same floor but a different ranking signal entirely" (BM25, character-n-gram cosine, something embedding-based using the same all-MiniLM-L6-v2 model that already runs locally).
-
Mixed sessions surface a calibration question we don't currently answer. What's the "right" recall for a session that has 3 blog posts and 1 paper? Should the paper's matches and the posts' matches use comparable floors so triage cost is even, or different floors so each source individually surfaces its real candidates regardless of comparative cost?
Directions worth thinking about, none of them mandates
- Length-based branch. Cheapest landing. Pick a
LONG_SOURCE_CHARS boundary (the existing extract-route profile already classifies short/medium/long), pick a LONG_TFIDF_FLOOR. Doesn't address the cross-source scale problem but reasonably catches the long-form recall miss.
- Profile-based branch.
extractions.profile is already computed and persisted. A profile → threshold map is more principled than a raw length check and would let medium-entity-heavy vs medium-concept-heavy get different floors if that turns out to matter.
- Normalize the similarity scale, not the threshold. Map raw cosine into a z-score or quantile against the corpus's own similarity distribution per source. Same floor everywhere, but expressed in a unit that's comparable across source shapes. Heavier change; more principled.
- Different ranking signal for long-form. TF-IDF rewards rare-token overlap; for dense vocabulary-rich material the rare tokens that matter (entity names, technical terms) get drowned out by the rest. An embedding-based candidate ranker for the long path (against the Chroma collection that already exists) might recall the right pages without a threshold tuning loop. Bigger change again.
- Live the current behavior with a documented gotcha. Pin a note in
CLAUDE.md and docs/spec/02-pipeline.md (the next time those get updated) so anyone wondering why their academic-paper compile produced a thin source-summary-only result has a thread to pull on. Lowest-effort, doesn't solve the underlying issue, may be the right answer if other priorities outweigh this.
What's known vs unknown
- Known: The threshold is content-shape-sensitive in practice. 0.15 came from recalibrating against long-form; the recalibration didn't fix the problem for all content shapes, it shifted which shape it's wrong for.
- Known: The cost of being too permissive is bounded and observable (triage call count × triage price, capped at
sourceCount × 3). The cost of being too strict is silent missed matches.
- Unknown: What the right
LONG_TFIDF_FLOOR actually is. Needs calibration against a long-form corpus the way 0.15 was calibrated against the prediction-markets one. The favourite-longshot-bias paper case suggests something in the 0.08-0.12 range would be safe, but one data point isn't a calibration.
- Unknown: Whether the problem is better solved at the threshold layer at all, or whether the ranking signal itself needs to change for long-form material.
- Unknown: How mixed-content sessions should be reasoned about. The current code doesn't acknowledge them as a distinct shape; whatever fix lands here should at minimum name what it does for mixed sessions, even if "treat each source independently" is the chosen non-answer.
Files in scope (for whoever picks this up)
app/src/app/api/compile/match/route.ts — the floor, the inline comment, the per-source candidate filter
nlp-service/routers/extraction.py — /extract/tfidf-overlap is where the cosine computation lives; if normalization changes, it changes here
app/src/lib/db.ts — would need a getProfileForSource(source_id) helper for the profile-based direction
What this is
A note on a known sensitivity in
app/src/app/api/compile/match/route.ts:172, not a prescribed fix. Anyone who wants to take it on should treat the rest of this as context, not a spec.The shape of the problem
The TF-IDF cosine floor at Phase 3 (Match) is currently a single hardcoded
0.15. The value is the right floor for some source shapes and the wrong floor for others, and the right answer for mixed corpora isn't obvious.What "0.15" does:
match/route.ts:162-170is exactly this: 58K-char paper vs 20-page corpus, correct matches at 0.208 / 0.185 / 0.137 / 0.118, two of them under what used to be the 0.3 floor (which is what triggered the recalibration to 0.15).slice(0, 3)cap keeps the worst case atsourceCount × 3triage calls. Triage is the cheapest LLM call site in the pipeline (thinking_budget=0,max_output_tokens=512), so the cost of being too permissive is bounded and visible in the daily-cap accounting. The cost of being too strict is invisible — real matches silently never reach triage.Why "just make it content-type-aware" isn't trivially the answer
The naive fix is a length branch — short content gets 0.15, long content gets something lower. That works for homogeneous sessions (a batch of all-academic-papers, or all-blog-posts). It doesn't cleanly work for the realistic case, which is mixed content in the same compile session and the same wiki.
Three things make mixed content hard:
The wiki page corpus is shared. When a short web source and a long academic source both run TF-IDF against the same set of existing pages, the similarities they produce live on different scales. A 0.18 from the short source and a 0.18 from the long source don't mean the same thing — but the comparison is happening against the same pages. Length-branching the source's threshold doesn't normalize across this; it just makes each source's recall criterion source-shape-aware.
The threshold isn't the only knob.
slice(0, 3)caps the candidate set per source; raising the cap on long-source paths is another way to reach the same outcome. The right answer might be "lower floor AND larger top-K for long content" or "same floor but a different ranking signal entirely" (BM25, character-n-gram cosine, something embedding-based using the sameall-MiniLM-L6-v2model that already runs locally).Mixed sessions surface a calibration question we don't currently answer. What's the "right" recall for a session that has 3 blog posts and 1 paper? Should the paper's matches and the posts' matches use comparable floors so triage cost is even, or different floors so each source individually surfaces its real candidates regardless of comparative cost?
Directions worth thinking about, none of them mandates
LONG_SOURCE_CHARSboundary (the existing extract-route profile already classifies short/medium/long), pick aLONG_TFIDF_FLOOR. Doesn't address the cross-source scale problem but reasonably catches the long-form recall miss.extractions.profileis already computed and persisted. A profile → threshold map is more principled than a raw length check and would letmedium-entity-heavyvsmedium-concept-heavyget different floors if that turns out to matter.CLAUDE.mdanddocs/spec/02-pipeline.md(the next time those get updated) so anyone wondering why their academic-paper compile produced a thin source-summary-only result has a thread to pull on. Lowest-effort, doesn't solve the underlying issue, may be the right answer if other priorities outweigh this.What's known vs unknown
sourceCount × 3). The cost of being too strict is silent missed matches.LONG_TFIDF_FLOORactually is. Needs calibration against a long-form corpus the way 0.15 was calibrated against the prediction-markets one. The favourite-longshot-bias paper case suggests something in the 0.08-0.12 range would be safe, but one data point isn't a calibration.Files in scope (for whoever picks this up)
app/src/app/api/compile/match/route.ts— the floor, the inline comment, the per-source candidate filternlp-service/routers/extraction.py—/extract/tfidf-overlapis where the cosine computation lives; if normalization changes, it changes hereapp/src/lib/db.ts— would need agetProfileForSource(source_id)helper for the profile-based direction