Skip to content

fix(web-search): accept <div> result titles in Brave keyless scraper#717

Closed
tongshu2023 wants to merge 1 commit into
THU-MAIC:mainfrom
tongshu2023:fix/brave-scraper-title-div-687
Closed

fix(web-search): accept <div> result titles in Brave keyless scraper#717
tongshu2023 wants to merge 1 commit into
THU-MAIC:mainfrom
tongshu2023:fix/brave-scraper-title-div-687

Conversation

@tongshu2023

Copy link
Copy Markdown
Contributor

Summary

Fixes #687 — the keyless Brave web-search path returns 0 results because Brave moved the result title from a <span> to a <div>.

Problem

parseBraveSearchHtml matched the title with a <span class="search-snippet-title">-only regex. Brave's current markup is:

<div class="title search-snippet-title line-clamp-1 …" title="Title">Title</div>

So titleMatch was always null, if (!title) continue; skipped every snippet, and the scraper returned an empty list even though the page contained 20 data-type="web" blocks (see the live evidence in #687).

Fix

Accept either tag for the title element, using a backreference so the open/close tags stay consistent:

/<(span|div)[^>]*class="[^"]*search-snippet-title[^"]*"[^>]*>([\s\S]*?)<\/\1>/i

Tests

  • Updated the parseBraveSearchHtml and searchWithBrave fixtures to Brave's current <div> markup (the old fixtures had drifted from the live page, which is why the suite stayed green through the breakage).
  • Added a dedicated legacy <span> case so back-compat stays covered.
  • tests/web-search: 25 tests green; prettier --check clean.

Note on live verification: repeat scrapes from my IP are currently answered with HTTP 429 (Brave rate-limits rapid keyless requests — consistent with the best-effort caveat in #687), so the fixtures mirror the live markup captured in the issue's evidence rather than a fresh scrape.

Brave moved the result title from <span class="search-snippet-title">
to <div class="title search-snippet-title line-clamp-1 ...">, so the
title regex matched nothing and every snippet was skipped - the keyless
Brave path returned 0 results against the live page.

Accept both tags (backreference keeps open/close consistent), update the
test fixtures to the current markup, and keep one legacy <span> case for
back-compat.

Closes THU-MAIC#687
@tongshu2023

Copy link
Copy Markdown
Contributor Author

Apologies — I missed that #688 (opened earlier by the team) already covers this. Closing in favor of that PR. The tests added here are free to cherry-pick if useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Brave keyless scraper returns 0 results — result-title markup changed (span→div)

1 participant