fix: detect JobRunner subprocess death and stop memory-profiler ESRCH spam by Alpaca233 · Pull Request #526 · Cephla-Lab/Squid

Alpaca233 · 2026-04-17T01:15:26Z

Summary

Add a watchdog thread to JobRunner that detects unexpected subprocess death (segfault, SIGKILL, OOM) via Process.sentinel and invokes a handler. MultiPointWorker wires this to request_abort_fn() so the acquisition stops within one timepoint check instead of silently queuing saves into a dead worker for hours.
Catch ProcessLookupError in _get_linux_pss_mb alongside FileNotFoundError/PermissionError/ValueError, so reading /proc/<zombie>/smaps_rollup no longer produces a WARNING + traceback on every sample.

Background

On a 48-hour run, the JobRunner subprocess segfaulted ~9 hours in inside libnvidia-glcore.so (inherited via fork() from the Qt/OpenGL parent — NVIDIA's userspace driver is famously not fork()-safe). The parent never noticed: /proc/<pid> was enumerated as a zombie, 465 subsequent timepoint save jobs queued into a multiprocessing.Queue nothing was consuming, and the UI reported normal completion. The only log signal was ~6,900 memory-profiler ESRCH tracebacks plus 466 Backpressure reset() called with 1 jobs pending warnings.

These fixes do not address the root cause (fork-unsafety — that is a separate follow-up to switch JobRunner to spawn after a payload-picklability audit). They turn a silent multi-hour data-loss failure into a loud, actionable error on the next timepoint check.

Test plan

pytest tests/control/core/test_job_runner_shutdown.py tests/control/core/test_job_runner_pending.py tests/control/core/test_backpressure.py — 63 passed
pytest tests/control/core/test_job_processing_downsampled.py tests/control/core/test_drain_all_issue.py — 13 passed
pytest tests/control/core/test_memory_profiler.py — 56 passed (1 skipped, platform-specific)
Manual smoke test: os.kill(runner.pid, SIGKILL) on a live JobRunner → watchdog fires within milliseconds, handler receives exitcode=-9 (-SIGKILL), ERROR log identifies PID and "died UNEXPECTEDLY"
pre-commit run (Black) passes
Full acquisition smoke test on hardware (kill runner mid-run in simulation, confirm acquisition aborts cleanly)

🤖 Generated with Claude Code

… spam A JobRunner subprocess dying mid-acquisition (e.g. segfault in a fork()-unsafe native library) was previously undetected: the parent kept dispatching save jobs into a queue nothing was consuming, and the acquisition appeared to complete normally while hundreds of timepoints were lost. The only visible symptom was a flood of memory-profiler tracebacks about /proc/<pid>/smaps_rollup. - JobRunner: spawn a daemon watchdog thread in start() that blocks on self.sentinel. When the subprocess exits, it distinguishes expected shutdown (via _shutdown_event) from unexpected death and invokes a registered handler with the exitcode. kill() now also sets _shutdown_event so intentional termination is not flagged. - MultiPointWorker: registers a handler that logs the failure and calls request_abort_fn(), so the acquisition loop exits on the next abort check instead of silently rotting for hours. - memory_profiler._get_linux_pss_mb: catch ProcessLookupError in addition to FileNotFoundError/PermissionError/ValueError. Reading /proc/<zombie>/smaps_rollup returns ESRCH, which is a sibling of FileNotFoundError under OSError and was not being caught. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Copilot

Pull request overview

This PR improves acquisition reliability by detecting unexpected JobRunner subprocess death (e.g., segfault/SIGKILL/OOM) and aborting the acquisition promptly, while also silencing repetitive Linux memory-profiler ESRCH warnings when /proc/<pid>/smaps_rollup disappears mid-sample.

Changes:

Add a JobRunner watchdog thread (waits on Process.sentinel) and a configurable unexpected-exit handler callback.
Wire MultiPointWorker to abort the acquisition when a JobRunner dies unexpectedly.
Treat ProcessLookupError (ESRCH) as a benign failure mode when reading Linux PSS from /proc.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
software/control/core/multi_point_worker.py	Registers a `JobRunner` unexpected-exit handler to abort acquisitions on subprocess death.
software/control/core/memory_profiler.py	Suppresses ESRCH tracebacks by catching `ProcessLookupError` while reading `/proc` smaps data.
software/control/core/job_processing.py	Adds watchdog thread + handler plumbing to `JobRunner` to detect unexpected subprocess exit.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-17T01:26:05Z


            self._job_runners.append((job_class, job_runner))
+            if job_runner is not None:
+                job_runner.set_unexpected_exit_handler(self._on_job_runner_died)


For newly created JobRunner instances, the handler is registered after job_runner.start(). There is a small but real race where the subprocess can die during warmup (or fail to start) before the handler is set, in which case the watchdog logs the error but won’t abort the acquisition. Consider registering the unexpected-exit handler before starting the process (and similarly ensuring pre-warmed runners have the handler set as soon as they’re created/handed off).

Copilot · 2026-04-17T01:26:06Z

+        if self._shutdown_event is not None:
+            self._shutdown_event.set()
+        super().kill()
+


The watchdog treats an exit as "expected" only when _shutdown_event is set. In this codebase JobRunner.terminate() is used during shutdown (e.g., MultiPointController.close), but terminate() does not set _shutdown_event, so the watchdog will log "died UNEXPECTEDLY" and invoke the handler during intentional termination. Consider overriding terminate() (and any other explicit-stop path you use) to set _shutdown_event the same way as kill()/shutdown() before signaling the process.

Suggested change

def terminate(self):

# Mark as expected so the watchdog treats the exit as intentional.

if self._shutdown_event is not None:

self._shutdown_event.set()

super().terminate()

- Override JobRunner.terminate() to set _shutdown_event so an intentional terminate() (e.g., from MultiPointController.close) is not reported as "died UNEXPECTEDLY" by the watchdog. Matches the kill() override. - Register the unexpected-exit handler on freshly created JobRunners before start(), eliminating the race where the subprocess could die during warmup before the handler was installed. - Register the handler on pre-warmed runners the moment MultiPointWorker adopts them (before set_acquisition_info), narrowing the uncovered window to the pre-handoff phase only. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

- Replace the "shutdown_event is set" signal with a dedicated _intentional_exit bool. _shutdown_event is a multiprocessing primitive that shutdown() nulls during cleanup, which opened a narrow race where the watchdog could read None and misclassify an intentional shutdown as unexpected death. The new flag survives cleanup and is set by all three explicit-stop paths: kill(), terminate(), and shutdown(). - Drop narrative comments that restate what the code already says, per project style. Keep the WHY comment about _intentional_exit vs _shutdown_event. - Collapse a split f-string log message to a single literal. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

- Add tests/control/core/test_job_runner_watchdog.py covering SIGKILL detection, intentional kill/terminate/shutdown suppression, handler exception isolation, and the _intentional_exit-survives-shutdown regression from commit 24751ec. - multi_point_worker: check is_alive() alongside is_ready() before adopting a pre-warmed runner. is_ready() reads a multiprocessing.Event the subprocess sets early in run(); once the subprocess dies the Event remains set in shared memory, so is_ready() alone can't distinguish a live runner from a corpse. Without is_alive(), a runner that segfaults during pre-warm would be adopted into the acquisition and resume the silent-rot failure mode this PR is meant to fix. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Alpaca233 and others added 2 commits April 16, 2026 18:02

style: apply Black formatting

3d8058d

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Alpaca233 requested a review from Copilot April 17, 2026 01:23

Copilot started reviewing on behalf of Alpaca233 April 17, 2026 01:23 View session

Copilot AI reviewed Apr 17, 2026

View reviewed changes

Alpaca233 and others added 3 commits April 17, 2026 11:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: detect JobRunner subprocess death and stop memory-profiler ESRCH spam#526

fix: detect JobRunner subprocess death and stop memory-profiler ESRCH spam#526
Alpaca233 wants to merge 5 commits into
Cephla-Lab:masterfrom
Alpaca233:fix/jobrunner-death-detection

Alpaca233 commented Apr 17, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 17, 2026

Uh oh!

Copilot AI Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

+    def terminate(self):
+        # Mark as expected so the watchdog treats the exit as intentional.
+        if self._shutdown_event is not None:
+            self._shutdown_event.set()
+        super().terminate()

Conversation

Alpaca233 commented Apr 17, 2026

Summary

Background

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants