Fix orphaned in_progress jobs after worker crash/restart by Manuscrit · Pull Request #59 · longtermrisk/openweights

Manuscrit · 2026-04-08T17:54:01Z

When a worker pod crashes (OOM, SIGKILL, power loss), the atexit shutdown handler never fires, leaving jobs stuck in in_progress status. If the pod restarts quickly with the same worker_id, the cluster manager's unresponsive-worker cleanup never triggers either (the worker keeps pinging). This causes orphaned jobs to accumulate — each crash leaves one more zombie in_progress job that no worker is executing.

Fix: on startup, before entering the job loop, revert any in_progress jobs assigned to this worker_id back to pending. A freshly started worker process cannot be executing anything, so any such jobs are guaranteed orphans from a previous lifetime.

When a worker pod crashes (OOM, SIGKILL, power loss), the atexit shutdown handler never fires, leaving jobs stuck in in_progress status. If the pod restarts quickly with the same worker_id, the cluster manager's unresponsive-worker cleanup never triggers either (the worker keeps pinging). This causes orphaned jobs to accumulate — each crash leaves one more zombie in_progress job that no worker is executing. Fix: on startup, before entering the job loop, revert any in_progress jobs assigned to this worker_id back to pending. A freshly started worker process cannot be executing anything, so any such jobs are guaranteed orphans from a previous lifetime. Co-Authored-By: Claude Opus 4.6 <[email protected]>

nielsrolf merged commit d015654 into longtermrisk:v0.9 Apr 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix orphaned in_progress jobs after worker crash/restart#59

Fix orphaned in_progress jobs after worker crash/restart#59
nielsrolf merged 1 commit intolongtermrisk:v0.9from
slacki-ai:fix/worker-orphaned-jobs-cleanup

Manuscrit commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Manuscrit commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants