fix(fetcher): release fetched payload after each worker iteration by Kyzgor · Pull Request #916 · permitio/opal

Kyzgor · 2026-06-08T18:16:20Z

Fixes Issue

Closes #844
Closes #770

Changes proposed

Fixes a memory high-water mark in the data fetching engine: a fetch worker held its last
fetched dataset (and HTTP response) in memory while it was idle.

opal_common/fetcher/engine/fetch_worker.py: null all five per-iteration locals
(event, callback, fetcher, res, data) in the worker's finally block, which runs
after the exception handlers so they still see event, and pre-initialize
fetcher = res = data = None before the try (those three may be unassigned if an early
exception occurs). An idle worker now holds no payload between queue items.
opal_common/fetcher/tests/fetch_worker_retention_test.py (new): a regression test that an
idle worker does not retain its last fetched payload. It uses a weakref, so it fails if the
cleanup is removed.

Root cause

fetch_worker runs while True: event, callback = await queue.get(); .... In CPython a frame
keeps its locals bound until they are reassigned or the frame exits, so a worker blocked on the
next queue.get() still referenced the previous iteration's event/fetcher/res/data,
which pins the last fetched dataset and its aiohttp response. With FETCHING_WORKER_COUNT
workers (default 6) this retained up to N full datasets, proportional to payload size, released
only on process restart. OPA's own memory was unaffected (the data is correctly stored there);
the retention was entirely client-side in the worker frames. That is also why it looked like a
leak yet heap inspection after things settle found nothing: the objects free as soon as a worker
handles a new item, but idle workers keep the last one.

This is the same mechanism behind both reports. #844 triggers a fetch on every reconnect
("Initial load"), and #770 triggers one on every periodic datasource refresh.

Check List (Check all the applicable boxes)

I sign off on contributing this submission to open-source
My code follows the code style of this project.
My change requires changes to the documentation.
I have updated the documentation accordingly.
All new and existing tests passed.
This PR does not contain plagiarized content.
The title of my pull request is a short description of the requested changes.

Screenshots

N/A. The behaviour is memory usage, quantified below.

Note to reviewers

Before/after, from a local harness driving the real DataUpdater -> fetch -> store path (full
reproduction steps available on request):

Scenario	Before fix	After fix
#844 reconnect loop (2 MB payload, 12 reconnects, default 6 workers)	staircase +~73 MB, ~6x payload retained per reconnect, never released	plateau, +~5 MB total; retained dataset copies 6 -> 1
#770 periodic refresh (13 MB payload, real OPA, 12 cycles)	climbs to ~630 MB high-water mark (about 6 workers x ~80 MB), held until restart	flat ~159 MB (+0.11 MB/cycle, i.e. noise)

The defining signature of the bug was that the retained-dataset count equalled the worker count.
After the fix the retained count is 1 (measured with the default FETCHING_WORKER_COUNT of 6),
and since the retention was structurally one dataset per worker it no longer scales with the
pool size. malloc_trim does not reclaim the pre-fix growth, which confirms genuine retention
rather than allocator fragmentation.

Correctness: the null-out is in finally, which runs after the except blocks that call
engine._on_failure(err, event), so the failure handler still receives the real event. The
existing fetcher and data_updater tests pass (10), and with the new regression test all 11 pass.

Scope: this PR only changes the worker frame-retention behaviour. During the investigation I also
found a small, separate unbounded object accumulation in the pub/sub reconnect path
(asyncio.as_completed in fastapi-websocket-pubsub). It grows one task and one queue per
reconnect but is only KB-scale and lives in a different repository, so I've raised it separately
and it is not part of this change. The growth fixed here is a bounded high-water mark under
normal (gc-enabled) runtime.

CI note: the security/snyk check tends to error on fork PRs (org token/quota) rather than
indicating a code problem. Happy to have a maintainer re-run it.

A fetch worker runs `while True: event, callback = await queue.get(); ...`. Python keeps a frame's locals bound until they are reassigned or the frame exits, so a worker blocked on the next `queue.get()` kept the previous iteration's `event`/`fetcher`/`res`/`data` alive — pinning the last fetched dataset and its HTTP response in memory. With N fetching workers this retained up to N full datasets indefinitely: a per-update memory high-water mark, proportional to payload size, that only reset on process restart. Null these locals in the worker's `finally` block so an idle worker holds no payload. Adds a regression test asserting an idle worker does not retain its last fetched payload. Closes permitio#844 Closes permitio#770

netlify · 2026-06-08T18:16:26Z

✅ Deploy Preview for opal-docs canceled.

Name	Link
🔨 Latest commit	`a93b6a0`
🔍 Latest deploy log	https://app.netlify.com/projects/opal-docs/deploys/6a2706f7de4a660007b728dd

This was referenced Jun 8, 2026

Memory Leak in OPAL Client During Reconnection #844

Open

OPAL Client memory consumption over time increase #770

Open

feat(client): bound concurrent in-flight data updates to cap peak memory #917

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(fetcher): release fetched payload after each worker iteration#916

fix(fetcher): release fetched payload after each worker iteration#916
Kyzgor wants to merge 1 commit into
permitio:masterfrom
Kyzgor:fix/fetch-worker-frame-retention

Kyzgor commented Jun 8, 2026 •

edited

Loading

Uh oh!

netlify Bot commented Jun 8, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Kyzgor commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fixes Issue

Changes proposed

Root cause

Check List (Check all the applicable boxes)

Screenshots

Note to reviewers

Uh oh!

netlify Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for opal-docs canceled.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Kyzgor commented Jun 8, 2026 •

edited

Loading

netlify Bot commented Jun 8, 2026 •

edited

Loading