fix: detect JobRunner subprocess death and stop memory-profiler ESRCH spam#526
fix: detect JobRunner subprocess death and stop memory-profiler ESRCH spam#526Alpaca233 wants to merge 5 commits into
Conversation
… spam A JobRunner subprocess dying mid-acquisition (e.g. segfault in a fork()-unsafe native library) was previously undetected: the parent kept dispatching save jobs into a queue nothing was consuming, and the acquisition appeared to complete normally while hundreds of timepoints were lost. The only visible symptom was a flood of memory-profiler tracebacks about /proc/<pid>/smaps_rollup. - JobRunner: spawn a daemon watchdog thread in start() that blocks on self.sentinel. When the subprocess exits, it distinguishes expected shutdown (via _shutdown_event) from unexpected death and invokes a registered handler with the exitcode. kill() now also sets _shutdown_event so intentional termination is not flagged. - MultiPointWorker: registers a handler that logs the failure and calls request_abort_fn(), so the acquisition loop exits on the next abort check instead of silently rotting for hours. - memory_profiler._get_linux_pss_mb: catch ProcessLookupError in addition to FileNotFoundError/PermissionError/ValueError. Reading /proc/<zombie>/smaps_rollup returns ESRCH, which is a sibling of FileNotFoundError under OSError and was not being caught. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
There was a problem hiding this comment.
Pull request overview
This PR improves acquisition reliability by detecting unexpected JobRunner subprocess death (e.g., segfault/SIGKILL/OOM) and aborting the acquisition promptly, while also silencing repetitive Linux memory-profiler ESRCH warnings when /proc/<pid>/smaps_rollup disappears mid-sample.
Changes:
- Add a
JobRunnerwatchdog thread (waits onProcess.sentinel) and a configurable unexpected-exit handler callback. - Wire
MultiPointWorkerto abort the acquisition when aJobRunnerdies unexpectedly. - Treat
ProcessLookupError(ESRCH) as a benign failure mode when reading Linux PSS from/proc.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| software/control/core/multi_point_worker.py | Registers a JobRunner unexpected-exit handler to abort acquisitions on subprocess death. |
| software/control/core/memory_profiler.py | Suppresses ESRCH tracebacks by catching ProcessLookupError while reading /proc smaps data. |
| software/control/core/job_processing.py | Adds watchdog thread + handler plumbing to JobRunner to detect unexpected subprocess exit. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| self._job_runners.append((job_class, job_runner)) | ||
| if job_runner is not None: | ||
| job_runner.set_unexpected_exit_handler(self._on_job_runner_died) |
There was a problem hiding this comment.
For newly created JobRunner instances, the handler is registered after job_runner.start(). There is a small but real race where the subprocess can die during warmup (or fail to start) before the handler is set, in which case the watchdog logs the error but won’t abort the acquisition. Consider registering the unexpected-exit handler before starting the process (and similarly ensuring pre-warmed runners have the handler set as soon as they’re created/handed off).
| if self._shutdown_event is not None: | ||
| self._shutdown_event.set() | ||
| super().kill() | ||
|
|
There was a problem hiding this comment.
The watchdog treats an exit as "expected" only when _shutdown_event is set. In this codebase JobRunner.terminate() is used during shutdown (e.g., MultiPointController.close), but terminate() does not set _shutdown_event, so the watchdog will log "died UNEXPECTEDLY" and invoke the handler during intentional termination. Consider overriding terminate() (and any other explicit-stop path you use) to set _shutdown_event the same way as kill()/shutdown() before signaling the process.
| def terminate(self): | |
| # Mark as expected so the watchdog treats the exit as intentional. | |
| if self._shutdown_event is not None: | |
| self._shutdown_event.set() | |
| super().terminate() |
- Override JobRunner.terminate() to set _shutdown_event so an intentional terminate() (e.g., from MultiPointController.close) is not reported as "died UNEXPECTEDLY" by the watchdog. Matches the kill() override. - Register the unexpected-exit handler on freshly created JobRunners before start(), eliminating the race where the subprocess could die during warmup before the handler was installed. - Register the handler on pre-warmed runners the moment MultiPointWorker adopts them (before set_acquisition_info), narrowing the uncovered window to the pre-handoff phase only. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
- Replace the "shutdown_event is set" signal with a dedicated _intentional_exit bool. _shutdown_event is a multiprocessing primitive that shutdown() nulls during cleanup, which opened a narrow race where the watchdog could read None and misclassify an intentional shutdown as unexpected death. The new flag survives cleanup and is set by all three explicit-stop paths: kill(), terminate(), and shutdown(). - Drop narrative comments that restate what the code already says, per project style. Keep the WHY comment about _intentional_exit vs _shutdown_event. - Collapse a split f-string log message to a single literal. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
- Add tests/control/core/test_job_runner_watchdog.py covering SIGKILL detection, intentional kill/terminate/shutdown suppression, handler exception isolation, and the _intentional_exit-survives-shutdown regression from commit 24751ec. - multi_point_worker: check is_alive() alongside is_ready() before adopting a pre-warmed runner. is_ready() reads a multiprocessing.Event the subprocess sets early in run(); once the subprocess dies the Event remains set in shared memory, so is_ready() alone can't distinguish a live runner from a corpse. Without is_alive(), a runner that segfaults during pre-warm would be adopted into the acquisition and resume the silent-rot failure mode this PR is meant to fix. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Summary
JobRunnerthat detects unexpected subprocess death (segfault, SIGKILL, OOM) viaProcess.sentineland invokes a handler.MultiPointWorkerwires this torequest_abort_fn()so the acquisition stops within one timepoint check instead of silently queuing saves into a dead worker for hours.ProcessLookupErrorin_get_linux_pss_mbalongsideFileNotFoundError/PermissionError/ValueError, so reading/proc/<zombie>/smaps_rollupno longer produces a WARNING + traceback on every sample.Background
On a 48-hour run, the
JobRunnersubprocess segfaulted ~9 hours in insidelibnvidia-glcore.so(inherited viafork()from the Qt/OpenGL parent — NVIDIA's userspace driver is famously notfork()-safe). The parent never noticed:/proc/<pid>was enumerated as a zombie, 465 subsequent timepoint save jobs queued into amultiprocessing.Queuenothing was consuming, and the UI reported normal completion. The only log signal was ~6,900 memory-profiler ESRCH tracebacks plus 466Backpressure reset() called with 1 jobs pendingwarnings.These fixes do not address the root cause (fork-unsafety — that is a separate follow-up to switch
JobRunnertospawnafter a payload-picklability audit). They turn a silent multi-hour data-loss failure into a loud, actionable error on the next timepoint check.Test plan
pytest tests/control/core/test_job_runner_shutdown.py tests/control/core/test_job_runner_pending.py tests/control/core/test_backpressure.py— 63 passedpytest tests/control/core/test_job_processing_downsampled.py tests/control/core/test_drain_all_issue.py— 13 passedpytest tests/control/core/test_memory_profiler.py— 56 passed (1 skipped, platform-specific)os.kill(runner.pid, SIGKILL)on a live JobRunner → watchdog fires within milliseconds, handler receivesexitcode=-9(-SIGKILL), ERROR log identifies PID and "died UNEXPECTEDLY"pre-commit run(Black) passes🤖 Generated with Claude Code