fix scraper Playwright cleanup hangs#432
Merged
Merged
Conversation
Prevent Playwright render failures from leaving page or browser cleanup promises unbounded. Reset the browser after render failures and force-kill Chromium if browser close itself hangs.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Hardens the Playwright-based HTML rendering middleware to prevent scrape workers from hanging indefinitely when rendering or cleanup operations stall (e.g., repeated timeouts on JS-heavy pages), and adds regression tests to validate recovery behavior.
Changes:
- Introduces a shared
withTimeouthelper and applies bounded timeouts to Playwright cleanup (route, page, context, browser). - Resets the shared Chromium browser after render failures or cleanup failures, including force-killing the browser process when close hangs.
- Adds tests covering render timeout recovery and hung page cleanup recovery.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| src/scraper/middleware/HtmlPlaywrightMiddleware.ts | Adds timeout-bounded cleanup, browser reset logic, and force-kill fallback for hung browser closes. |
| src/scraper/middleware/HtmlPlaywrightMiddleware.test.ts | Adds regression tests for browser reset on render timeout and bounded cleanup when page.close hangs. |
Comments suppressed due to low confidence (1)
src/scraper/middleware/HtmlPlaywrightMiddleware.ts:200
- closeBrowser() unconditionally sets this.browser = null in finally. Because scrapes run concurrent batches (Promise.all) and HtmlPipeline reuses a single HtmlPlaywrightMiddleware instance, another task can launch a new browser while this close is still awaiting; the finally block would then clobber the new instance, leaking a live Chromium process and leaving the middleware with a null reference.
} finally {
// Always set to null to allow fresh browser on next request
this.browser = null;
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Avoid the Joplin GFM table conversion path hanging on very large tables by splitting oversized tables into smaller row chunks before Turndown runs. This preserves full table content while bounding per-table conversion work. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Keep forced Playwright process cleanup from throwing out of failure-handling paths when SIGKILL delivery itself fails or reports no target. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Preserve caption and colgroup metadata when chunking large tables, keep unsplittable oversized tables as HTML via placeholders, and make Playwright reset reasons distinguish render and cleanup failures. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Repeated Playwright render timeouts on JS-heavy pages could leave cleanup awaits unbounded, which made a scrape worker appear alive but stop making progress. The WezTerm full-site repro also exposed a second hang class: a very large HTML table could make Joplin's GFM table conversion CPU-bound for more than two minutes. This PR hardens both paths so failures recover and pathological table conversion is avoided.
Summary
browserTimeoutMsand selector/cleanup waits bounded bypageTimeoutMs.Notes
A local
fetch-urlrun againsthttps://wezterm.org/colorschemes/n/index.htmlcompleted successfully. The broader WezTerm crawl then exposed the separate table-conversion stall onhttps://wezterm.org/config/lua/wezterm/nerdfonts.html?q=. The broader table strategy discussion is tracked separately as Related #433.Fixes #428
Validation
npx vitest run src/scraper/middleware/HtmlPlaywrightMiddleware.test.tsnpx vitest run src/scraper/middleware/HtmlToMarkdownMiddleware.test.tsnpm run typechecknpm run lintnpm testfetch-urlforhttps://wezterm.org/config/lua/wezterm/nerdfonts.html?q=completed in about 24s instead of exceeding the 2-minute isolated Joplin GFM repro watchdog.