fix(web-search): match Brave's current result-title markup#688
fix(web-search): match Brave's current result-title markup#688ly-wang19 wants to merge 2 commits into
Conversation
Brave moved the web-result title from `<span class="search-snippet-title">` to `<div class="… search-snippet-title …">`, so parseBraveSearchHtml hit `if (!title) continue` for every snippet and returned 0 results against the live page. The existing test stayed green because its fixture still used the old <span> markup (drifted from reality). Accept either <span> or <div> for the title, and update the fixtures to the current markup (keeping one legacy <span> case). Verified end-to-end against a real search.brave.com scrape: 0 results before, real results after. Closes THU-MAIC#687
|
+1 — independently hit the same bug and can confirm this fix against the live page (before finding this PR; my duplicate #719 is closed in favor of this one). Live data from a scrape of One optional nit: |
Use a backreference (`<(span|div)…></\1>`) so the title's closing tag must match its opening tag, per review feedback. Prevents a malformed `<span …>…</div>` from being mis-parsed as a title. Title text moves to capture group 2; existing matched-tag markup (div and legacy span) is unaffected. Adds a regression test for the mismatched-tag case.
|
Thanks for the independent confirmation against the live page, and for closing #719 in favor of this — appreciated. Applied your backreference suggestion: the title regex now ties the closing tag to the captured opening tag ( |
What & why
The keyless Brave scrape (
parseBraveSearchHtml,lib/web-search/brave.ts) returns 0 results against Brave’s current HTML. Brave moved the result title element:<span class="search-snippet-title">Title</span><div class="title search-snippet-title line-clamp-1 …" title="Title">Title</div>The title regex matched
<span>only, soif (!title) continue;skipped every snippet → empty results. The existing test stayed green because its fixture still used the old<span>markup — it had drifted from the live page and gave false confidence.Closes #687.
Verified end-to-end (live scrape)
A real
search.brave.comfetch (with the app’s exactBRAVE_HEADERS) returns HTTP 200 with 20data-type="web"snippet blocks; the current parser extracts 0. With the fix, the same scrape returns real results, e.g.Photosynthesis - Wikipedia -> https://en.wikipedia.org/wiki/Photosynthesiswith content.Fix
Accept
<span>or<div>for the title element, and updatetests/web-search/brave.test.tsfixtures to the current markup (keeping one legacy<span>case for back-compat).Test plan
npx vitest run tests/web-search/brave.test.ts(2 pass),tsc/prettier/eslintclean.This un-blocks #642 (keyless Brave as a server default), whose review asked to confirm keyless returns results end-to-end.