fix(opal-server): reconnect broadcaster on backbone disconnect (PER-15065)#915
fix(opal-server): reconnect broadcaster on backbone disconnect (PER-15065)#915zeevmoney wants to merge 7 commits into
Conversation
A brief broadcaster-backbone outage escalated into a fleet-wide client
connection-drop storm that only a worker restart cleared: the shared
broadcaster reader task completed on disconnect and was never restarted
while clients stayed connected, so every reconnecting client was cancelled.
Add ReconnectingBroadcaster (reader reconnects with bounded exponential
backoff, stays pending across a transient backbone loss) and an idempotent
SafeConnectionManager (eliminates the ValueError('list.remove(x): x not in
list') churn). Gate via OPAL_BROADCAST_RECONNECT_* config keys. Includes
unit + integration tests with negative controls that reproduce the bug, and
an extended app-tests e2e (graceful + ungraceful backbone drop).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
✅ Deploy Preview for opal-docs ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
… broadcaster gap Builds on the reconnecting broadcaster so a transient backbone outage no longer silently desyncs OPAL server instances: - (B) bounded outbound replay buffer: broadcasts that fail while the backbone is down are queued and replayed on reconnect so peers that re-subscribe catch up. - (A) resync on recovery (the guarantee): after any gap each worker forces its own clients to reconnect and re-fetch full policy + data state, so the fleet converges (a worker may have missed incoming peer updates during its gap). Pins the broadcaster's listening context during a resync so recycling clients does not drop the listener count to 0 and cancel the reader; single-flight recovery plus a lock around the buffer/overflow flag; drops un-serializable buffered items so a poison payload cannot wedge the buffer; cancels pending recovery tasks on shutdown. Adds OPAL_BROADCAST_REPLAY_BUFFER_SIZE / _RESYNC_ON_RECONNECT / _RESYNC_SETTLE_SECONDS. Includes multi-instance convergence tests and extends the app-tests e2e with a publish-during-outage consistency scenario. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A wedged broadcaster reader (reader task absent/dead while clients depend on it) previously left /healthcheck returning 200, so k8s never routed away from or restarted the bad worker — a transient backbone blip could wedge a pod for hours. Add ReconnectingBroadcaster.is_reader_healthy(): unhealthy only when listeners are present and the reader task is missing or done (crashed / gave up) — it stays healthy through a normal transient reconnect, so the probe does not flap during a backbone blip. /healthcheck returns 503 in that wedged state (/ stays a trivial liveness 200), gated by OPAL_BROADCAST_HEALTHCHECK_ENABLED (default true). Unit tests for the health contract plus an end-to-end route test (200 / 503 / kill-switch). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…d DB kill Drives a real ReconnectingBroadcaster reader against a fault-injectable backbone: - transient DB kill -> reader stays pending (reconnecting) -> is_reader_healthy() stays True / /healthcheck stays 200 (the no-flap property: no needless restart during a normal blip), and recovers on bus.recover(); - permanent kill that exhausts reconnect retries -> reader task done -> is_reader_healthy() False / /healthcheck 503. Plus route-level checks (pending -> 200, gave-up -> 503 with body, / stays 200). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The docformatter hook pinned at v1.7.5 ships a second hook (`docformatter-venv`) declaring `language: python_venv`, which prek and pre-commit>=4 reject for the whole manifest — so hooks could not initialize locally. CI passed only because it pins `pre-commit<4` on Python 3.12. - Bump docformatter v1.7.5 -> v1.7.6 (drops the `docformatter-venv` hook; keeps `language: python`). v1.7.6 is byte-identical to v1.7.5 on this codebase, so the bump introduces no reformatting (v1.7.7 changes docstring wrapping, hence v1.7.6). - Pin the hook toolchain to Python 3.12 via default_language_version: the pinned black 23.1.0 / isort / docformatter do not run on 3.13+, so hooks failed on machines whose default interpreter is newer. - Apply docformatter to healthcheck_db_kill_test.py (the one file that predated this fix and had not been run through the hook). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds resilience to OPAL server’s pub/sub layer so transient broadcast-backbone disconnects don’t cascade into client disconnect storms or leave workers silently wedged, while also making broadcaster-reader failure visible via /healthcheck for k8s self-healing.
Changes:
- Introduces
ReconnectingBroadcaster(bounded backoff + jitter, replay buffer, post-gap resync hook) andSafeConnectionManager(idempotent disconnect + staggered close) to harden pub/sub behavior. - Updates server health endpoints so
/remains trivial liveness, while/healthcheckcan return 503 when a reconnecting broadcaster is wedged and clients depend on it. - Adds new config keys plus extensive unit/integration/e2e tests and docs updates; updates pre-commit config and app-tests to validate the regression scenarios.
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| packages/opal-server/opal_server/pubsub_resilience.py | Adds reconnecting broadcaster + replay/resync logic and idempotent connection manager. |
| packages/opal-server/opal_server/pubsub.py | Wires resilient broadcaster/config, installs SafeConnectionManager, and registers resync-on-reconnect behavior. |
| packages/opal-server/opal_server/server.py | Makes /healthcheck broadcaster-aware (503 on wedged reader) and adds a trivial / liveness route. |
| packages/opal-server/opal_server/config.py | Adds broadcaster reconnect/replay/resync/healthcheck configuration flags and tunables. |
| packages/opal-server/opal_server/tests/safe_connection_manager_test.py | Unit tests for idempotent disconnect and staggered close behavior. |
| packages/opal-server/opal_server/tests/reconnecting_broadcaster_test.py | Unit tests for reconnecting reader lifecycle, backoff bounds, and health predicate. |
| packages/opal-server/opal_server/tests/healthcheck_endpoint_test.py | End-to-end tests for / and /healthcheck behavior (including kill switch). |
| packages/opal-server/opal_server/tests/healthcheck_db_kill_test.py | Simulates DB/backbone kill to validate non-flapping health and 200/503 transitions. |
| packages/opal-server/opal_server/tests/broadcaster_reconnect_integration_test.py | Integrates with PubSubEndpoint.main_loop to ensure clients aren’t cancelled across backbone drops. |
| packages/opal-server/opal_server/tests/broadcaster_consistency_integration_test.py | Multi-instance convergence tests covering replay + resync single-flight behavior. |
| documentation/docs/getting-started/running-opal/run-opal-server/broadcast-interface.mdx | Documents broadcaster reconnection/resync/replay configuration (needs one more env var entry). |
| app-tests/run.sh | Extends e2e script to test graceful/ungraceful backbone outages, storm regression guards, and consistency across gaps. |
| app-tests/docker-compose-app-tests.yml | Adds Postgres healthcheck for the broadcast channel container. |
| .pre-commit-config.yaml | Pins hook runtime to Python 3.12 and bumps docformatter rev to avoid pre-commit manifest issues. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| | Env Var Name | Default | Function | | ||
| | :------------------------------------------- | :------ | :----------------------------------------------------------------------------------------------------------------------------------------------- | | ||
| | OPAL_BROADCAST_RECONNECT_ENABLED | `true` | Reconnect the broadcaster reader on a backbone disconnect instead of dropping all client connections. Set to `false` for the legacy behavior. | | ||
| | OPAL_BROADCAST_RECONNECT_MAX_RETRIES | `0` | Maximum consecutive reconnect attempts before giving up and letting the worker restart. `0` means retry forever. | | ||
| | OPAL_BROADCAST_RECONNECT_BACKOFF_MIN_SECONDS | `0.5` | Minimum backoff (seconds) between reconnect attempts. | | ||
| | OPAL_BROADCAST_RECONNECT_BACKOFF_MAX_SECONDS | `30` | Maximum backoff (seconds) between reconnect attempts. | | ||
| | OPAL_BROADCAST_REPLAY_BUFFER_SIZE | `10000` | Max number of outbound broadcasts buffered while the backbone is down and replayed on reconnect (`0` disables buffering). On overflow the oldest are dropped. | | ||
| | OPAL_BROADCAST_RESYNC_ON_RECONNECT | `true` | After a backbone gap, force this worker's clients to reconnect so they re-fetch full policy + data state. Set to `false` to rely only on best-effort replay. | | ||
| | OPAL_BROADCAST_RESYNC_SETTLE_SECONDS | `2` | Grace period after a reconnect before replaying buffered broadcasts and resyncing clients, to let peer servers re-subscribe. | |
…an retries) The e2e failed in CI for two reasons: 1. The cross-instance consistency check was racy. A data update published to one server while the backbone is down reaches the second server's client only via the replay buffer (the base data config restores only /static on a resync), and with the default reconnect backoff (max 30s) the second server could still be mid-reconnect when the first replayed after the 2s settle, missing it. Pin the broadcaster timing in the compose env (BACKOFF_MIN=0.5, BACKOFF_MAX=2, RESYNC_SETTLE_SECONDS=3) so both replicas re-subscribe before the replay fires. 2. The retry loop re-ran main() without tearing the stack down, so generate_opal_keys could not bind host port 7002 and the previous attempt's stale transient client ERRORs tripped check_no_error. Tear the stack down between attempts. Validated locally: the full app-tests suite (including the graceful + ungraceful broadcaster kills and the consistency scenario) now passes on the first attempt. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rden recovery Remove the previous fix attempt for the broadcaster connection-loss storm — the OPAL_BROADCAST_CONN_LOSS_BUGFIX_EXPERIMENT_ENABLED flag — now that the reconnecting broadcaster is the real fix. - Drop the experiment flag; derive ignore_broadcaster_disconnected from the broadcaster type: False for ReconnectingBroadcaster (a completed reader surfaces the disconnect so clients reconnect), True (library-safe) for the stock EventBroadcaster rollback so the legacy path degrades to "stale but connected" instead of the storm. - Pin the whole post-gap recovery in a listening context so the reader cannot be cancelled mid-recovery (previously only close_all_staggered was pinned). - Flush the replay buffer without holding the buffer lock across network I/O; re-enqueue the unsent tail on a mid-drain transport failure. - Cancel AND join child tasks on reader shutdown. - Remove the write-only _buffer_overflowed flag and the config-doc claim that overflow triggers a resync (it does not; the resync fires unconditionally on every gap). Tests: add reconnect/disconnect/slow-connection coverage — flaky-reconnect-recovers, slow-connect-stays-pending, subscribe-fail-reconnects, partial-replay-requeue, buffer overflow/disabled, deterministic single-flight, max_retries>0-then-recover, skip-own, shutdown-cancels-children, and the resync-pins-the-reader-alive invariant. The consistency harness now holds a listening context to mirror a connected client. Reviewed by python-pro + fastapi-pro; findings addressed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
EliMoshkovich
left a comment
There was a problem hiding this comment.
Deep Review (correctness, data consistency/integrity, security)
Verdict
High-quality, well-tested PR that fixes a real production failure mode. Recommend approval after addressing findings 1–3. The core mechanism (reconnecting reader + idempotent disconnect + healthcheck) is sound, and I verified its assumptions against the actual upstream fastapi_websocket_pubsub 1.0.1 / fastapi_websocket_rpc sources. The findings below are about edge-case lifecycle interactions and the precise scope of the consistency "guarantee" — none invalidate the design.
What I verified against upstream (all correct)
__read_notifications__/__broadcast_notifications__have trailing double underscores → no name mangling, so the overrides genuinely dispatch (upstream_subscribe_to_all_topicscallsself.__broadcast_notifications__— resolves to the subclass).- The double-disconnect storm is real:
WebsocketRPCEndpoint.main_loopcallshandle_disconnect()(disconnect #1) and its outer bareexceptcallsmanager.disconnect()again (#2); upstreamConnectionManager.disconnectis an unguardedlist.remove.PubSubEndpointoffers no manager injection point, so theendpoint.endpoint.managerattribute swap (with the loudRuntimeErrorguard) is the only option. The manager is swapped before any connection is served — safe. - The staging "wedge" is real: upstream
start_reader_taskconnects-and-raises insideEventBroadcasterContextManager.__aenter__after incrementing_listen_count; the exception skips__aexit__, leaking the count, leavingget_reader_task() → Noneandasyncio.wait([task, None])→TypeError. The override (reader loop owns connection) eliminates the raise path. - The recovery pin (
get_listening_context()around the whole recovery) correctly prevents the upstream context manager from cancelling the reader whenclose_all_staggereddrives the client-held listen count down; upstream__aexit__cancels without awaiting, so no deadlock with_cancel_pending_tasks. - The resync premise holds client-side: close 1012 →
websocketsraisesConnectionClosedError(not ClosedOK) → client retries →DataUpdater.on_connect→get_base_policy_data()full refetch. - No ack-semantics regression:
EventNotifier.callback_subscribersalready swallowed broadcast exceptions upstream, so publish was always fire-and-forget; buffering strictly improves on silent loss.
Findings
1. HIGH — _had_prior_connection conflates "backbone gap" with "reader task restart" → spurious resync, worst-case self-sustaining close loop
_had_prior_connection is instance-level state (pubsub_resilience.py, __read_notifications__). When the last client disconnects, upstream's context manager cancels the reader and sets _subscription_task = None. When the next client connects, a fresh reader task starts, sees _had_prior_connection == True from its predecessor, and schedules a full gap recovery — flush + resync that closes the client that just connected.
- Benign case: every reader restart costs each client one spurious 1012-close + full policy/data refetch.
- Pathological case (small deployments): close client → 2s tail sleep → pin releases → count hits 0 → reader cancelled → client reconnects (its retry backoff has grown past the settle window) → fresh reader → spurious recovery → close again — a self-sustaining churn loop, ironically a miniature version of the storm being fixed.
This affects default deployments: OPAL_STATISTICS_ENABLED defaults to False, so nothing holds a permanent listening context and the listen count is driven purely by clients. The e2e runs with OPAL_STATISTICS_ENABLED=true (reader pinned forever at startup), so this path is untested.
Fix: make gap detection reader-task-local (a local variable in __read_notifications__ instead of the instance attribute). It's strictly more correct: a reader restart only happens after the listener count hit 0, i.e. no clients existed to miss anything, and the first client back does a full refetch anyway.
2. MEDIUM — single-flight recovery can permanently skip a gap's replay and resync
_schedule_gap_recovery drops the request if a recovery is in flight. The flap-during-settle case is handled (the in-flight recovery's later flush picks up the new buffer entries). But a gap that opens and closes after the in-flight recovery's flush (i.e., during _fire_reconnect, which sleeps up to ~2×settle ≈ 4s) gets nothing: its buffered broadcasts sit unflushed and no resync fires — until some future gap triggers the next recovery, which may be days away. A managed-Postgres restart is exactly a multi-drop flap on this timescale.
Fix: replace skip with a "rerun requested" flag — when _schedule_gap_recovery finds a live recovery, mark it; _recover_after_gap loops while the flag is set.
3. MEDIUM — after max-retries give-up, the worker doesn't actually restart; the storm regime returns until k8s acts
The docstrings say exhausting retries lets "the worker restart", but in-process nothing restarts it: the reader-done → _graceful_shutdown() callback exists only in the statistics-enabled path (server.py). With statistics off and BROADCAST_RECONNECT_MAX_RETRIES > 0: after give-up, the done reader stays installed (_subscription_task is reset to None only when the count reaches 0 — which churn may never allow), so every reconnecting client's asyncio.wait completes instantly and it's dropped — the original storm. Recovery depends on /healthcheck 503 + a liveness probe pointed at it, which the PR only recommends as a deployment follow-up.
The default (0 = retry forever) avoids this, but the knob is an operator footgun. Fix: on give-up, schedule the same graceful shutdown the statistics path uses (or document loudly that MAX_RETRIES > 0 requires the liveness probe). Related: a backbone that flaps connect-OK/insta-close never increments attempt (it's reset on connect), so MAX_RETRIES can never trip on flap loops — the counter only sees consecutive exceptions.
4. MEDIUM (docs/data-integrity) — the "resync = guarantee" claim is broader than the mechanism
Two precise caveats the broadcast-interface.mdx text should carry:
- Scope of the guarantee. A client's resync refetch covers policy + the configured data sources (
get_base_policy_data). Runtime-published incremental updates (e.g.POST /data/configwith inlinedataor one-off URLs) are not part of that baseline — for those, the best-effort replay buffer is the only recovery path. The guarantee holds when data sources serve full current truth, but for generic OPAL users with ad-hoc inline updates, a dropped replay (buffer overflow, finding 2, mid-flush crash) is a permanently lost update. The e2e implicitly proves this: convergence ofconsistency_userrequires the replay log lines. - Replay reordering. A buffered update replayed at settle-time can land on peers after a newer live update published post-recovery (per-worker ordering is enforced; cross-worker is not). With URL-fetch entries this self-heals (fetch returns current truth); with inline-data entries, stale-wins is possible if the peer's own resync happened to complete before the replay arrived. Narrow window, worth a sentence in the docs.
5. LOW — partial-replay re-enqueue can evict the newest entries
_outbound_buffer.extendleft(reversed(unsent)) on a bounded deque: if concurrent failures refilled the buffer during the lock-free publish phase, extendleft evicts from the right — the newest items — inverting the documented drop-oldest policy. Cheap fix: rebuild as unsent + existing, truncated from the front.
6. LOW — resync amplification on a flapping backbone
Every re-subscribe after a gap fires a full recovery; each resync = fleet-wide client recycle + full policy/data refetch against git/data sources. Single-flight dedups overlaps but each flap-recovery cycle fires anew. Consider a minimum interval / cooldown between resyncs. Also: the buffer is bounded by count (10000), not bytes — large inline payloads could pin significant per-worker memory during a long outage.
Security assessment
No significant concerns; a few observations:
- No new auth surface.
/healthcheckand/were already unauthenticated; the 503 body leaks one bit ("backbone wedged") to unauthenticated callers — standard for probes, acceptable. No external input can influence the health state. - Buffered payloads can contain secrets (data-update entries may embed fetcher credentials). They are held in memory only and never logged —
_buffer_outbound/flush log only counts and exception reprs, consistent with the recent FetcherConfig log-redaction fix (a846b50). Good. - Forced reconnect doesn't weaken auth: clients closed with 1012 re-authenticate (JWT) on reconnect like any new connection.
- DoS angle: anyone able to flap the backbone can trigger repeated fleet-wide refetch storms (finding 6) — but backbone access is already a trusted position; the resync kill-switch and a cooldown would bound it.
random.uniformfor jitter is non-cryptographic use — fine.
Tests & quality
Test quality is genuinely strong: negative controls reproducing both upstream bugs, a real two-instance convergence test with NOTIFY-like semantics, the health-state matrix, and a live fault→reconnect→give-up driver. Gaps: the statistics-off reader-restart lifecycle (finding 1), a gap closing during the late recovery phase (finding 2), and partial-replay-with-refill eviction (finding 5). test_resync_is_single_flight...'s len(calls) <= 2 is a weak assertion (passes with 0 calls), though the follow-up assertions partially compensate. The .pre-commit-config.yaml toolchain pin is unrelated housekeeping — fine, well-explained, ideally a separate commit.
Bottom line: the availability fix is correct and verified against the real upstream internals; the consistency machinery works but has three lifecycle edge cases (findings 1–3) worth fixing before merge — they're each small, localized changes — and the docs should state the guarantee's actual scope (finding 4).
🤖 Generated with Claude Code
Changes proposed
Fixes a production incident where a brief broadcaster-backbone (Postgres
LISTEN/NOTIFY, Redis, Kafka) disconnect escalated into a self-sustaining, fleet-wide OPAL-client connection-drop storm that only a worker restart cleared — closes the cross-instance consistency gap the availability fix would otherwise open, and makes the failure visible to k8s so it can self-heal.Root cause (in the pub/sub libraries OPAL wraps): with
ignore_broadcaster_disconnected=False, every client websocket waits on a single shared broadcaster reader task. On a backbone drop the reader completes/never-recovers, so reconnecting clients are cancelled indefinitely; a non-idempotentdisconnectadds aValueError('list.remove(x): x not in list')storm. The same fragility has two faces: a loud storm (prod, when the reader was already running and the drop cancels clients) and a silent wedge (staging, when the drop madestart_reader_taskraise → leaked listener count →get_reader_task()returnsNone→asyncio.wait([client, None])TypeErrorcrashes every new/wshandler) — both needing a manual restart.This PR (Phase 1 — self-contained in OPAL):
ReconnectingBroadcaster): the reader reconnects with bounded backoff + jitter and stays pending across a transient drop, so clients are never spuriously cancelled. Because itsstart_reader_taskno longer connects-and-raises, the stagingNone-readerTypeErrorwedge cannot occur either.SafeConnectionManager): eliminates thelist.removestorm./healthcheck(new): a wedged reader (reader task absent/dead while clients depend on it) now returns 503 so a k8s readiness/liveness probe can route away from / restart the worker instead of leaving it broken for hours. It stays healthy through a normal transient reconnect (no probe flap)./remains a trivial liveness 200.OPAL_BROADCAST_CONN_LOSS_BUGFIX_EXPERIMENT_ENABLED(the earlier, partial fix attempt).ignore_broadcaster_disconnectedis now derived from the broadcaster type —Falsefor the reconnecting broadcaster (a completed reader surfaces the disconnect so clients reconnect), library-safeTruefor the stockEventBroadcasterrollback (OPAL_BROADCAST_RECONNECT_ENABLED=False) so the legacy path degrades to stale-but-connected rather than the storm. Blast radius: a deployment still setting that env var has it harmlessly ignored (Confi drops unknown keys).OPAL_BROADCAST_RECONNECT_ENABLED,…_MAX_RETRIES,…_BACKOFF_MIN/MAX_SECONDS,OPAL_BROADCAST_REPLAY_BUFFER_SIZE,OPAL_BROADCAST_RESYNC_ON_RECONNECT,OPAL_BROADCAST_RESYNC_SETTLE_SECONDS,OPAL_BROADCAST_HEALTHCHECK_ENABLED(all default-on, reversible)./healthcheckroute test plus a simulated-DB-kill test (the probe stays 200 while the reader reconnects through a transient outage — no flap — and flips to 503 only when reconnect gives up), and an extendedapp-testse2e (graceful + ungraceful backbone kill, publish-during-outage convergence, regression guards). Docs updated inbroadcast-interface.mdx.A follow-up (Phase 2) upstreams the fixes to
fastapi_websocket_rpc/fastapi_websocket_pubsub, bumps the pins, and removes these OPAL-side stop-gaps.Check List (Check all the applicable boxes)
Note to reviewers
Internal tracking: Permit PER-15065.
Reader-task lifecycle (the availability fix)
flowchart TD A["Backbone connection drops"] --> B{"Reconnect enabled?"} B -- "no (stock behavior)" --> C["Shared reader task completes or fails to start"] C --> D["get_reader_task done or None"] D --> E["Clients cancelled (storm) or new /ws crash (wedge)"] E --> F["Needs a manual restart"] B -- "yes (this PR)" --> G["Reader retries with backoff, stays pending"] G --> H["Clients keep their websocket; fan-out resumes on reconnect"] G -- "after max retries" --> CConsistency across a gap
Ordering is enforced: on recovery a worker replays the buffer first, then fires the resync.
Why the broadcaster-aware healthcheck
Two real incidents showed the OPAL server can get wedged in pub/sub state it cannot self-heal, while k8s saw the pods as healthy the whole time (
/healthcheckreturned 200 unconditionally) — so nothing routed away or restarted, and a ~20-second backbone blip became a multi-hour outage.is_reader_healthy()is judged against the listener count (unhealthy only when listeners depend on a missing/completed reader), so it does not flap during a normal transient reconnect; the probe'sfailureThresholdabsorbs any rare race. Recommended deployment follow-up: point the OPAL server liveness probe at/healthcheck(today only readiness uses it) so a wedged worker is restarted, and disableauto_minor_version_upgradeon the broadcaster RDS (the trigger in both incidents).Reviewed by python-pro + fastapi-pro (two rounds); fixes applied
Round 1: reader self-cancellation during resync (pin a listening context); single-flight recovery + a lock around the buffer; cancel pending recovery tasks on shutdown; drop un-serializable buffered items; replaced a production
assertwithraise; per-worker jitter to avoid a fleet-wide resync stampede; corrected e2e log guards.Round 2 (after removing the experiment flag): make the
BROADCAST_RECONNECT_ENABLED=Falserollback safe (ignore_broadcaster_disconnected=Truefor the stock broadcaster — degrade to stale, not storm); pin the whole post-gap recovery so the reader can't be cancelled mid-recovery; flush the replay buffer without holding the buffer lock across network I/O and re-enqueue the unsent tail on a mid-drain failure; cancel and join child tasks on shutdown; remove the write-only_buffer_overflowedflag and the phantom config-doc claim. Findings that were verified non-issues (a non-observable healthcheck startup race, infinite-retry "healthy while retrying",asyncio.Lockconstruction) were left as-is with rationale.How it was tested
/healthcheck200/503/kill-switch test). Fullopal-server+opal-commonsuites pass; existingOpalServerstartup test unaffected.app-tests/run.sh(CI E2E Tests job) — graceful + ungraceful backbone kill, publish-during-outage consistency,no list.remove+ replay-ran guards.Blast radius
SafeConnectionManagerand the healthcheck logic apply to every server; existing startup test confirms no regression. Reconnect/resync/healthcheck activate only with a broadcaster configured and are each gated by a default-on, runtime-reversible flag.Generated with Claude Code