fix(web-search): accept <div> result titles in Brave keyless scraper#717
Closed
tongshu2023 wants to merge 1 commit into
Closed
fix(web-search): accept <div> result titles in Brave keyless scraper#717tongshu2023 wants to merge 1 commit into
tongshu2023 wants to merge 1 commit into
Conversation
Brave moved the result title from <span class="search-snippet-title"> to <div class="title search-snippet-title line-clamp-1 ...">, so the title regex matched nothing and every snippet was skipped - the keyless Brave path returned 0 results against the live page. Accept both tags (backreference keeps open/close consistent), update the test fixtures to the current markup, and keep one legacy <span> case for back-compat. Closes THU-MAIC#687
Contributor
Author
|
Apologies — I missed that #688 (opened earlier by the team) already covers this. Closing in favor of that PR. The tests added here are free to cherry-pick if useful. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #687 — the keyless Brave web-search path returns 0 results because Brave moved the result title from a
<span>to a<div>.Problem
parseBraveSearchHtmlmatched the title with a<span class="search-snippet-title">-only regex. Brave's current markup is:So
titleMatchwas always null,if (!title) continue;skipped every snippet, and the scraper returned an empty list even though the page contained 20data-type="web"blocks (see the live evidence in #687).Fix
Accept either tag for the title element, using a backreference so the open/close tags stay consistent:
/<(span|div)[^>]*class="[^"]*search-snippet-title[^"]*"[^>]*>([\s\S]*?)<\/\1>/iTests
parseBraveSearchHtmlandsearchWithBravefixtures to Brave's current<div>markup (the old fixtures had drifted from the live page, which is why the suite stayed green through the breakage).<span>case so back-compat stays covered.tests/web-search: 25 tests green;prettier --checkclean.Note on live verification: repeat scrapes from my IP are currently answered with HTTP 429 (Brave rate-limits rapid keyless requests — consistent with the best-effort caveat in #687), so the fixtures mirror the live markup captured in the issue's evidence rather than a fresh scrape.