Skip to content

fix(web-search): match Brave's current result-title markup#688

Open
ly-wang19 wants to merge 2 commits into
THU-MAIC:mainfrom
ly-wang19:fix/brave-scraper-title-markup
Open

fix(web-search): match Brave's current result-title markup#688
ly-wang19 wants to merge 2 commits into
THU-MAIC:mainfrom
ly-wang19:fix/brave-scraper-title-markup

Conversation

@ly-wang19

Copy link
Copy Markdown
Contributor

What & why

The keyless Brave scrape (parseBraveSearchHtml, lib/web-search/brave.ts) returns 0 results against Brave’s current HTML. Brave moved the result title element:

  • old: <span class="search-snippet-title">Title</span>
  • now: <div class="title search-snippet-title line-clamp-1 …" title="Title">Title</div>

The title regex matched <span> only, so if (!title) continue; skipped every snippet → empty results. The existing test stayed green because its fixture still used the old <span> markup — it had drifted from the live page and gave false confidence.

Closes #687.

Verified end-to-end (live scrape)

A real search.brave.com fetch (with the app’s exact BRAVE_HEADERS) returns HTTP 200 with 20 data-type="web" snippet blocks; the current parser extracts 0. With the fix, the same scrape returns real results, e.g. Photosynthesis - Wikipedia -> https://en.wikipedia.org/wiki/Photosynthesis with content.

Heads-up: keyless scraping is best-effort — rapid repeat requests get rate-limited/challenged by Brave (some return an empty page). That’s a separate docs/UX caveat, not addressed here.

Fix

Accept <span> or <div> for the title element, and update tests/web-search/brave.test.ts fixtures to the current markup (keeping one legacy <span> case for back-compat).

Test plan

npx vitest run tests/web-search/brave.test.ts (2 pass), tsc/prettier/eslint clean.

This un-blocks #642 (keyless Brave as a server default), whose review asked to confirm keyless returns results end-to-end.

Brave moved the web-result title from `<span class="search-snippet-title">`
to `<div class="… search-snippet-title …">`, so parseBraveSearchHtml hit
`if (!title) continue` for every snippet and returned 0 results against the
live page. The existing test stayed green because its fixture still used the
old <span> markup (drifted from reality).

Accept either <span> or <div> for the title, and update the fixtures to the
current markup (keeping one legacy <span> case). Verified end-to-end against a
real search.brave.com scrape: 0 results before, real results after.

Closes THU-MAIC#687
@tongshu2023

Copy link
Copy Markdown
Contributor

+1 — independently hit the same bug and can confirm this fix against the live page (before finding this PR; my duplicate #719 is closed in favor of this one).

Live data from a scrape of https://search.brave.com/search?q=photosynthesis just now (2026-06-10), using the exact BRAVE_HEADERS from brave.ts: HTTP 200, 20 data-type=web snippets, 0 <span class=...search-snippet-title...> matches, 20 <div class=...search-snippet-title...> matches. So the span-only regex extracts exactly 0 of 20, and this PR's pattern matches all 20.

One optional nit: <\/(?:span|div)> would also accept a mismatched pair like <span ...>...</div>. A backreference makes the close tag track the open tag: /<(span|div)[^>]*class=[^]*search-snippet-title[^]*[^>]*>([\s\S]*?)<\/\1>/i (title moves to capture group 2). Harmless either way given stripHtml, so feel free to ignore.

Use a backreference (`<(span|div)…></\1>`) so the title's closing tag must
match its opening tag, per review feedback. Prevents a malformed
`<span …>…</div>` from being mis-parsed as a title. Title text moves to
capture group 2; existing matched-tag markup (div and legacy span) is
unaffected. Adds a regression test for the mismatched-tag case.
@ly-wang19

Copy link
Copy Markdown
Contributor Author

Thanks for the independent confirmation against the live page, and for closing #719 in favor of this — appreciated.

Applied your backreference suggestion: the title regex now ties the closing tag to the captured opening tag (<(span|div)…></\1>, with the title in capture group 2), so a malformed <span …>…</div> is no longer picked up. Added a regression test for the mismatched-tag case. Existing matched markup (current div and legacy span) is unaffected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Brave keyless scraper returns 0 results — result-title markup changed (span→div)

2 participants