Skip to content

fix scraper Playwright cleanup hangs#432

Merged
arabold merged 4 commits into
mainfrom
arabold/fix-playwright-cleanup-hang
Jun 6, 2026
Merged

fix scraper Playwright cleanup hangs#432
arabold merged 4 commits into
mainfrom
arabold/fix-playwright-cleanup-hang

Conversation

@arabold

@arabold arabold commented Jun 6, 2026

Copy link
Copy Markdown
Owner

Repeated Playwright render timeouts on JS-heavy pages could leave cleanup awaits unbounded, which made a scrape worker appear alive but stop making progress. The WezTerm full-site repro also exposed a second hang class: a very large HTML table could make Joplin's GFM table conversion CPU-bound for more than two minutes. This PR hardens both paths so failures recover and pathological table conversion is avoided.

Summary

  • Bound Playwright route, page, context, and browser cleanup with configured timeouts.
  • Reset the shared Chromium browser after render failures or cleanup failures so later pages start from a fresh browser state.
  • Force-kill the Chromium process if browser close itself hangs.
  • Keep navigation bounded by browserTimeoutMs and selector/cleanup waits bounded by pageTimeoutMs.
  • Split oversized HTML tables into smaller row chunks before GFM conversion, preserving full table content while avoiding the Joplin plugin's large-table hot path.
  • Add regression tests for render timeout recovery, hung page cleanup, and oversized table splitting.

Notes

A local fetch-url run against https://wezterm.org/colorschemes/n/index.html completed successfully. The broader WezTerm crawl then exposed the separate table-conversion stall on https://wezterm.org/config/lua/wezterm/nerdfonts.html?q=. The broader table strategy discussion is tracked separately as Related #433.

Fixes #428

Validation

  • npx vitest run src/scraper/middleware/HtmlPlaywrightMiddleware.test.ts
  • npx vitest run src/scraper/middleware/HtmlToMarkdownMiddleware.test.ts
  • npm run typecheck
  • npm run lint
  • npm test
  • Local fetch-url for https://wezterm.org/config/lua/wezterm/nerdfonts.html?q= completed in about 24s instead of exceeding the 2-minute isolated Joplin GFM repro watchdog.

Prevent Playwright render failures from leaving page or browser cleanup promises unbounded. Reset the browser after render failures and force-kill Chromium if browser close itself hangs.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 6, 2026 15:53

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Hardens the Playwright-based HTML rendering middleware to prevent scrape workers from hanging indefinitely when rendering or cleanup operations stall (e.g., repeated timeouts on JS-heavy pages), and adds regression tests to validate recovery behavior.

Changes:

  • Introduces a shared withTimeout helper and applies bounded timeouts to Playwright cleanup (route, page, context, browser).
  • Resets the shared Chromium browser after render failures or cleanup failures, including force-killing the browser process when close hangs.
  • Adds tests covering render timeout recovery and hung page cleanup recovery.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
src/scraper/middleware/HtmlPlaywrightMiddleware.ts Adds timeout-bounded cleanup, browser reset logic, and force-kill fallback for hung browser closes.
src/scraper/middleware/HtmlPlaywrightMiddleware.test.ts Adds regression tests for browser reset on render timeout and bounded cleanup when page.close hangs.
Comments suppressed due to low confidence (1)

src/scraper/middleware/HtmlPlaywrightMiddleware.ts:200

  • closeBrowser() unconditionally sets this.browser = null in finally. Because scrapes run concurrent batches (Promise.all) and HtmlPipeline reuses a single HtmlPlaywrightMiddleware instance, another task can launch a new browser while this close is still awaiting; the finally block would then clobber the new instance, leaking a live Chromium process and leaving the middleware with a null reference.
      } finally {
        // Always set to null to allow fresh browser on next request
        this.browser = null;
      }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/scraper/middleware/HtmlPlaywrightMiddleware.ts
arabold and others added 2 commits June 6, 2026 10:30
Avoid the Joplin GFM table conversion path hanging on very large tables by splitting oversized tables into smaller row chunks before Turndown runs. This preserves full table content while bounding per-table conversion work.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Keep forced Playwright process cleanup from throwing out of failure-handling paths when SIGKILL delivery itself fails or reports no target.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

Comment thread src/scraper/middleware/HtmlToMarkdownMiddleware.ts Outdated
Comment thread src/scraper/middleware/HtmlToMarkdownMiddleware.ts
Comment thread src/scraper/middleware/HtmlPlaywrightMiddleware.ts Outdated
Preserve caption and colgroup metadata when chunking large tables, keep unsplittable oversized tables as HTML via placeholders, and make Playwright reset reasons distinguish render and cleanup failures.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@arabold arabold merged commit 35f9342 into main Jun 6, 2026
2 checks passed
@arabold arabold deleted the arabold/fix-playwright-cleanup-hang branch June 6, 2026 17:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Repeated Playwright waitForSelector timeouts on JS-heavy pages cause service to silently hang

2 participants