Skip to content

fix: detect JobRunner subprocess death and stop memory-profiler ESRCH spam#526

Open
Alpaca233 wants to merge 5 commits into
Cephla-Lab:masterfrom
Alpaca233:fix/jobrunner-death-detection
Open

fix: detect JobRunner subprocess death and stop memory-profiler ESRCH spam#526
Alpaca233 wants to merge 5 commits into
Cephla-Lab:masterfrom
Alpaca233:fix/jobrunner-death-detection

Conversation

@Alpaca233
Copy link
Copy Markdown
Collaborator

Summary

  • Add a watchdog thread to JobRunner that detects unexpected subprocess death (segfault, SIGKILL, OOM) via Process.sentinel and invokes a handler. MultiPointWorker wires this to request_abort_fn() so the acquisition stops within one timepoint check instead of silently queuing saves into a dead worker for hours.
  • Catch ProcessLookupError in _get_linux_pss_mb alongside FileNotFoundError/PermissionError/ValueError, so reading /proc/<zombie>/smaps_rollup no longer produces a WARNING + traceback on every sample.

Background

On a 48-hour run, the JobRunner subprocess segfaulted ~9 hours in inside libnvidia-glcore.so (inherited via fork() from the Qt/OpenGL parent — NVIDIA's userspace driver is famously not fork()-safe). The parent never noticed: /proc/<pid> was enumerated as a zombie, 465 subsequent timepoint save jobs queued into a multiprocessing.Queue nothing was consuming, and the UI reported normal completion. The only log signal was ~6,900 memory-profiler ESRCH tracebacks plus 466 Backpressure reset() called with 1 jobs pending warnings.

These fixes do not address the root cause (fork-unsafety — that is a separate follow-up to switch JobRunner to spawn after a payload-picklability audit). They turn a silent multi-hour data-loss failure into a loud, actionable error on the next timepoint check.

Test plan

  • pytest tests/control/core/test_job_runner_shutdown.py tests/control/core/test_job_runner_pending.py tests/control/core/test_backpressure.py — 63 passed
  • pytest tests/control/core/test_job_processing_downsampled.py tests/control/core/test_drain_all_issue.py — 13 passed
  • pytest tests/control/core/test_memory_profiler.py — 56 passed (1 skipped, platform-specific)
  • Manual smoke test: os.kill(runner.pid, SIGKILL) on a live JobRunner → watchdog fires within milliseconds, handler receives exitcode=-9 (-SIGKILL), ERROR log identifies PID and "died UNEXPECTEDLY"
  • pre-commit run (Black) passes
  • Full acquisition smoke test on hardware (kill runner mid-run in simulation, confirm acquisition aborts cleanly)

🤖 Generated with Claude Code

Alpaca233 and others added 2 commits April 16, 2026 18:02
… spam

A JobRunner subprocess dying mid-acquisition (e.g. segfault in a
fork()-unsafe native library) was previously undetected: the parent kept
dispatching save jobs into a queue nothing was consuming, and the
acquisition appeared to complete normally while hundreds of timepoints
were lost. The only visible symptom was a flood of memory-profiler
tracebacks about /proc/<pid>/smaps_rollup.

- JobRunner: spawn a daemon watchdog thread in start() that blocks on
  self.sentinel. When the subprocess exits, it distinguishes expected
  shutdown (via _shutdown_event) from unexpected death and invokes a
  registered handler with the exitcode. kill() now also sets
  _shutdown_event so intentional termination is not flagged.
- MultiPointWorker: registers a handler that logs the failure and calls
  request_abort_fn(), so the acquisition loop exits on the next abort
  check instead of silently rotting for hours.
- memory_profiler._get_linux_pss_mb: catch ProcessLookupError in
  addition to FileNotFoundError/PermissionError/ValueError. Reading
  /proc/<zombie>/smaps_rollup returns ESRCH, which is a sibling of
  FileNotFoundError under OSError and was not being caught.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves acquisition reliability by detecting unexpected JobRunner subprocess death (e.g., segfault/SIGKILL/OOM) and aborting the acquisition promptly, while also silencing repetitive Linux memory-profiler ESRCH warnings when /proc/<pid>/smaps_rollup disappears mid-sample.

Changes:

  • Add a JobRunner watchdog thread (waits on Process.sentinel) and a configurable unexpected-exit handler callback.
  • Wire MultiPointWorker to abort the acquisition when a JobRunner dies unexpectedly.
  • Treat ProcessLookupError (ESRCH) as a benign failure mode when reading Linux PSS from /proc.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
software/control/core/multi_point_worker.py Registers a JobRunner unexpected-exit handler to abort acquisitions on subprocess death.
software/control/core/memory_profiler.py Suppresses ESRCH tracebacks by catching ProcessLookupError while reading /proc smaps data.
software/control/core/job_processing.py Adds watchdog thread + handler plumbing to JobRunner to detect unexpected subprocess exit.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +356 to +374

self._job_runners.append((job_class, job_runner))
if job_runner is not None:
job_runner.set_unexpected_exit_handler(self._on_job_runner_died)
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For newly created JobRunner instances, the handler is registered after job_runner.start(). There is a small but real race where the subprocess can die during warmup (or fail to start) before the handler is set, in which case the watchdog logs the error but won’t abort the acquisition. Consider registering the unexpected-exit handler before starting the process (and similarly ensuring pre-warmed runners have the handler set as soon as they’re created/handed off).

Copilot uses AI. Check for mistakes.
if self._shutdown_event is not None:
self._shutdown_event.set()
super().kill()

Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The watchdog treats an exit as "expected" only when _shutdown_event is set. In this codebase JobRunner.terminate() is used during shutdown (e.g., MultiPointController.close), but terminate() does not set _shutdown_event, so the watchdog will log "died UNEXPECTEDLY" and invoke the handler during intentional termination. Consider overriding terminate() (and any other explicit-stop path you use) to set _shutdown_event the same way as kill()/shutdown() before signaling the process.

Suggested change
def terminate(self):
# Mark as expected so the watchdog treats the exit as intentional.
if self._shutdown_event is not None:
self._shutdown_event.set()
super().terminate()

Copilot uses AI. Check for mistakes.
Alpaca233 and others added 3 commits April 17, 2026 11:07
- Override JobRunner.terminate() to set _shutdown_event so an intentional
  terminate() (e.g., from MultiPointController.close) is not reported as
  "died UNEXPECTEDLY" by the watchdog. Matches the kill() override.
- Register the unexpected-exit handler on freshly created JobRunners
  before start(), eliminating the race where the subprocess could die
  during warmup before the handler was installed.
- Register the handler on pre-warmed runners the moment MultiPointWorker
  adopts them (before set_acquisition_info), narrowing the uncovered
  window to the pre-handoff phase only.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
- Replace the "shutdown_event is set" signal with a dedicated
  _intentional_exit bool. _shutdown_event is a multiprocessing
  primitive that shutdown() nulls during cleanup, which opened a
  narrow race where the watchdog could read None and misclassify an
  intentional shutdown as unexpected death. The new flag survives
  cleanup and is set by all three explicit-stop paths: kill(),
  terminate(), and shutdown().
- Drop narrative comments that restate what the code already says,
  per project style. Keep the WHY comment about _intentional_exit
  vs _shutdown_event.
- Collapse a split f-string log message to a single literal.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
- Add tests/control/core/test_job_runner_watchdog.py covering SIGKILL
  detection, intentional kill/terminate/shutdown suppression, handler
  exception isolation, and the _intentional_exit-survives-shutdown
  regression from commit 24751ec.
- multi_point_worker: check is_alive() alongside is_ready() before
  adopting a pre-warmed runner. is_ready() reads a multiprocessing.Event
  the subprocess sets early in run(); once the subprocess dies the Event
  remains set in shared memory, so is_ready() alone can't distinguish a
  live runner from a corpse. Without is_alive(), a runner that segfaults
  during pre-warm would be adopted into the acquisition and resume the
  silent-rot failure mode this PR is meant to fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants