Skip to content

Coalesce concurrent session refreshes and fix refresh-path deadlock#6

Open
evanqhuang wants to merge 2 commits intoAdvik-B:masterfrom
evanqhuang:fix/concurrent-403-refresh
Open

Coalesce concurrent session refreshes and fix refresh-path deadlock#6
evanqhuang wants to merge 2 commits intoAdvik-B:masterfrom
evanqhuang:fix/concurrent-403-refresh

Conversation

@evanqhuang
Copy link
Copy Markdown

Summary

Scraper is documented as safe to share across goroutines, but three interacting concurrency defects in lib/cloudscraper.go break that guarantee:

  1. Thundering-herd in handle403. The in403Retry bool elects one retrying goroutine; the other N−1 concurrent 403 callers return ErrMaxRetriesExceeded without retrying. A single Cloudflare blip turns into N−1 spurious failures, and the error is a lie — those callers never exceeded any retry budget.
  2. Deadlock on first scheduled refresh. do() holds s.mu while calling refreshSession, which calls s.Gets.dos.mu.Lock(). sync.Mutex is not reentrant, so the first automatic refresh (default: one hour after first request) deadlocks the scraper.
  3. Data race on UserAgent / cipher suites. handle403's refresh path releases s.mu before calling refreshSession, which reassigns s.UserAgent and mutates transport cipher suites; concurrent do() calls read s.UserAgent.Headers with no lock. go test -race flags this once tests exist.

Fix

  • golang.org/x/sync/singleflight to coalesce refreshes under a single "session-refresh" key. Concurrent refresh requests collapse into one upstream probe, and a failed refresh propagates the same error to every coalesced caller — no blind retry against stale state.
  • Split do into do + doWithRefresh(req, allowRefresh). refreshSession's root-URL probe calls it with allowRefresh=false, breaking the do → refreshSession → do reentrancy that both caused the deadlock and would otherwise make singleflight.Do wait on itself. submitChallengeForm also uses allowRefresh=false: a failed challenge-form POST must not recursively enter the refresh group.
  • Lock discipline. s.mu now guards only state transitions — UserAgent swap, sessionStartTime, cipher suites, cookie jar — and is released before any I/O. doWithRefresh snapshots s.UserAgent under s.mu before using it.
  • Meaningful refresh failure. A 403 on the refresh probe is treated as a failed refresh (matches the original semantics from before the refactor).

Test plan

New file lib/cloudscraper_concurrency_test.go — the repo previously had zero tests. All four run under go test -race:

  • TestHandle403_CoalescesConcurrentRefreshes — 20 concurrent 403 callers produce ≤ 3 refresh probes (ideally 1) and all succeed.
  • TestAutoRefresh_NoDeadlock — 1ms SessionRefreshInterval + 20 concurrent callers complete within 10s. Old code hangs.
  • TestRefreshFailure_PropagatesToAllCallers — upstream permanently 403 → every coalesced caller receives a refresh-failure error.
  • TestConcurrentReadsUnderRefresh_NoDataRace — 50 goroutines × 3 iterations while forcing repeated refreshes; -race detects any unguarded UserAgent read.

Verification:

go build ./...
go vet ./...
go test -race -count=5 ./...

All pass locally, 5 iterations clean.

Dependency

Adds golang.org/x/sync v0.15.0 as a direct dependency. Stdlib-adjacent module, single package used (singleflight). Rolling our own coalescer via sync.Cond is more code and more bug surface for no benefit.

Three interacting concurrency defects made Scraper unsafe to share across
goroutines, contrary to its documented use:

* handle403 used a single in403Retry bool to elect one retrying goroutine;
  all other concurrent 403 callers immediately returned ErrMaxRetriesExceeded
  without retrying, so a single Cloudflare blip turned into N-1 spurious
  failures and a misleading error.

* The scheduled-refresh block in do() held s.mu across refreshSession's
  network probe, which re-entered s.Get -> s.do -> s.mu.Lock(). sync.Mutex
  is not reentrant, so the first automatic refresh (default: one hour after
  the first request) deadlocked the scraper.

* handle403 invoked refreshSession without holding s.mu while refreshSession
  reassigned s.UserAgent and mutated cipher suites; concurrent callers in
  do() read s.UserAgent.Headers with no lock, racing the writes.

Fix:

* Replace the in403Retry bool with a singleflight.Group keyed on
  "session-refresh". Concurrent refresh requests collapse into one upstream
  probe, and a failed refresh propagates the same error to every coalesced
  caller instead of triggering a blind retry against stale state.

* Split do() into do(req) + doWithRefresh(req, allowRefresh). The scheduled
  refresh check now takes s.mu only to snapshot sessionStartTime, then
  releases it before any I/O. refreshSession's root-URL probe uses
  doWithRefresh(req, false), breaking the do -> refreshSession -> do
  reentrancy that caused the deadlock and would otherwise cause
  singleflight.Do to wait on itself.

* Guard UserAgent reads with s.mu at request time; guard the whole refresh
  state transition (UserAgent, sessionStartTime, cipher suites, cookie jar)
  under s.mu in refreshSession so doWithRefresh's snapshot is consistent.

* Treat a 403 from the refresh probe as a failed refresh so callers see a
  meaningful error rather than silently continuing with an unchanged session.

* submitChallengeForm now uses doWithRefresh(req, false): a failed challenge
  submission is not a normal 403 and must not recursively enter the
  singleflight group that the refresh probe may already hold.
Covers the regressions fixed in the previous commit:

* TestHandle403_CoalescesConcurrentRefreshes asserts that N concurrent 403s
  produce O(1) refresh probes (not N) and that every caller succeeds, not
  N-1 spurious ErrMaxRetriesExceeded failures.

* TestAutoRefresh_NoDeadlock forces a refresh on every request with a
  millisecond SessionRefreshInterval. Old code deadlocks on first refresh;
  new code completes within the 10s test deadline.

* TestRefreshFailure_PropagatesToAllCallers verifies that if the upstream
  is still 403 after refresh, all coalesced callers receive the same
  refresh-failure error rather than racing on stale state.

* TestConcurrentReadsUnderRefresh_NoDataRace runs 50 goroutines x 3
  iterations while forcing repeated refreshes; under -race, any unguarded
  read of Scraper.UserAgent or the transport cipher suites trips the
  detector.

Tests disable stealth mode to avoid its default 500ms-2s human-like delay
between requests, which would make the concurrency assertions impractical.

Run with `go test -race -count=5 ./lib/` to shake out scheduler-induced
flakes.
@evanqhuang evanqhuang force-pushed the fix/concurrent-403-refresh branch from ced7d19 to 89573c0 Compare April 18, 2026 06:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant