Skip to content

fix(fetcher): release fetched payload after each worker iteration#916

Open
Kyzgor wants to merge 1 commit into
permitio:masterfrom
Kyzgor:fix/fetch-worker-frame-retention
Open

fix(fetcher): release fetched payload after each worker iteration#916
Kyzgor wants to merge 1 commit into
permitio:masterfrom
Kyzgor:fix/fetch-worker-frame-retention

Conversation

@Kyzgor

@Kyzgor Kyzgor commented Jun 8, 2026

Copy link
Copy Markdown

Fixes Issue

Closes #844
Closes #770

Changes proposed

Fixes a memory high-water mark in the data fetching engine: a fetch worker held its last
fetched dataset (and HTTP response) in memory while it was idle.

  • opal_common/fetcher/engine/fetch_worker.py: null all five per-iteration locals
    (event, callback, fetcher, res, data) in the worker's finally block, which runs
    after the exception handlers so they still see event, and pre-initialize
    fetcher = res = data = None before the try (those three may be unassigned if an early
    exception occurs). An idle worker now holds no payload between queue items.
  • opal_common/fetcher/tests/fetch_worker_retention_test.py (new): a regression test that an
    idle worker does not retain its last fetched payload. It uses a weakref, so it fails if the
    cleanup is removed.

Root cause

fetch_worker runs while True: event, callback = await queue.get(); .... In CPython a frame
keeps its locals bound until they are reassigned or the frame exits, so a worker blocked on the
next queue.get() still referenced the previous iteration's event/fetcher/res/data,
which pins the last fetched dataset and its aiohttp response. With FETCHING_WORKER_COUNT
workers (default 6) this retained up to N full datasets, proportional to payload size, released
only on process restart. OPA's own memory was unaffected (the data is correctly stored there);
the retention was entirely client-side in the worker frames. That is also why it looked like a
leak yet heap inspection after things settle found nothing: the objects free as soon as a worker
handles a new item, but idle workers keep the last one.

This is the same mechanism behind both reports. #844 triggers a fetch on every reconnect
("Initial load"), and #770 triggers one on every periodic datasource refresh.

Check List (Check all the applicable boxes)

  • I sign off on contributing this submission to open-source
  • My code follows the code style of this project.
  • My change requires changes to the documentation.
  • I have updated the documentation accordingly.
  • All new and existing tests passed.
  • This PR does not contain plagiarized content.
  • The title of my pull request is a short description of the requested changes.

Screenshots

N/A. The behaviour is memory usage, quantified below.

Note to reviewers

Before/after, from a local harness driving the real DataUpdater -> fetch -> store path (full
reproduction steps available on request):

Scenario Before fix After fix
#844 reconnect loop (2 MB payload, 12 reconnects, default 6 workers) staircase +~73 MB, ~6x payload retained per reconnect, never released plateau, +~5 MB total; retained dataset copies 6 -> 1
#770 periodic refresh (13 MB payload, real OPA, 12 cycles) climbs to ~630 MB high-water mark (about 6 workers x ~80 MB), held until restart flat ~159 MB (+0.11 MB/cycle, i.e. noise)

The defining signature of the bug was that the retained-dataset count equalled the worker count.
After the fix the retained count is 1 (measured with the default FETCHING_WORKER_COUNT of 6),
and since the retention was structurally one dataset per worker it no longer scales with the
pool size. malloc_trim does not reclaim the pre-fix growth, which confirms genuine retention
rather than allocator fragmentation.

Correctness: the null-out is in finally, which runs after the except blocks that call
engine._on_failure(err, event), so the failure handler still receives the real event. The
existing fetcher and data_updater tests pass (10), and with the new regression test all 11 pass.

Scope: this PR only changes the worker frame-retention behaviour. During the investigation I also
found a small, separate unbounded object accumulation in the pub/sub reconnect path
(asyncio.as_completed in fastapi-websocket-pubsub). It grows one task and one queue per
reconnect but is only KB-scale and lives in a different repository, so I've raised it separately
and it is not part of this change. The growth fixed here is a bounded high-water mark under
normal (gc-enabled) runtime.

CI note: the security/snyk check tends to error on fork PRs (org token/quota) rather than
indicating a code problem. Happy to have a maintainer re-run it.

A fetch worker runs `while True: event, callback = await queue.get(); ...`.
Python keeps a frame's locals bound until they are reassigned or the frame
exits, so a worker blocked on the next `queue.get()` kept the previous
iteration's `event`/`fetcher`/`res`/`data` alive — pinning the last fetched
dataset and its HTTP response in memory. With N fetching workers this retained
up to N full datasets indefinitely: a per-update memory high-water mark,
proportional to payload size, that only reset on process restart.

Null these locals in the worker's `finally` block so an idle worker holds no
payload. Adds a regression test asserting an idle worker does not retain its
last fetched payload.

Closes permitio#844
Closes permitio#770
@netlify

netlify Bot commented Jun 8, 2026

Copy link
Copy Markdown

Deploy Preview for opal-docs canceled.

Name Link
🔨 Latest commit a93b6a0
🔍 Latest deploy log https://app.netlify.com/projects/opal-docs/deploys/6a2706f7de4a660007b728dd

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Memory Leak in OPAL Client During Reconnection OPAL Client memory consumption over time increase

1 participant