Fix premature global idle when a parked Sync waiter becomes runnable#195
Fix premature global idle when a parked Sync waiter becomes runnable#195TrentHouliston wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR fixes a scheduler edge case in NUClear’s thread pool where a latched pending_idle signal could trigger an idle epoch before attempting to dequeue newly-runnable work, allowing a premature global idle while runnable tasks are still queued.
Changes:
- Reorders
Pool::get_task()handling sopending_idleis consumed as a wakeup hint without firing idle up-front, deferring the idle decision to the dequeue/!gotpath. - Adds a regression test that reproduces the “Director” ping-pong topology and asserts correct ordering under idle-driven progression.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
src/threading/scheduler/Pool.cpp |
Stops firing idle epochs pre-dequeue when pending_idle is latched, preventing premature global idle. |
tests/tests/dsl/IdleDirectorPingPong.cpp |
Adds a regression test covering the parked-waiter-becomes-runnable idle ordering scenario. |
2a0c1da to
fa3cbf8
Compare
fa3cbf8 to
63bfb11
Compare
| // The relaxed-ish load short-circuits the (more expensive) store on the common path | ||
| // where nothing has been latched, so a busy worker never pays for the exclusive | ||
| // cacheline acquire that an unconditional store would force every iteration. | ||
| if (pending_idle.load(std::memory_order_acquire)) { | ||
| pending_idle.store(false, std::memory_order_release); | ||
| } |
There was a problem hiding this comment.
🤖 Good catch - reverted this back to pending_idle.exchange(false, std::memory_order_acq_rel) (from the plain store suggested in an earlier review pass). Keeping the consume as an atomic RMW preserves the intended symmetry with register_external_waiter()'s exchange(true) producer-side check, so a concurrent waiter parking right as the worker consumes the latch can't have its signal silently dropped. Added a comment explaining why this must stay an exchange. Verified the fix with a full local ctest run (66/66 passing) plus 20 repeated runs of the deterministic regression test.
There was a problem hiding this comment.
🤖 Correction to my previous reply: switching to exchange while still discarding its return value doesn't actually add any protection over store - both just place false into the same total modification order for this atomic, so there's no observable difference between them here. My earlier reply overclaimed.
Reverted back to pending_idle.store(false, std::memory_order_release) and rewrote the comment to explain the actual reason this is safe: submit() (when a drained task is enqueued) and unregister_external_waiter() (when external_waiters returns to 0) both call notify_one()/notify_all() unconditionally, under the pool's mutex, on every state transition the wait predicate actually depends on. pending_idle's own notify (in register_external_waiter(), gated to the 0→1 transition) is purely a latency optimization to promptly wake an already-sleeping worker - if it's ever skipped due to a race, correctness is still preserved by those other unconditional notifies. Re-validated with a full local ctest run (66/66 passing).
d37adca to
b5b625e
Compare
Pool::get_task honoured the pending_idle latch by firing an idle epoch before attempting to dequeue. When a parked external waiter (e.g. a task blocked on a Sync group's token) becomes runnable and is drained into the pool's queue, firing idle up-front dropped the pool's active count and released its active_pools slot while a runnable task was still queued, letting a global idle epoch fire prematurely. This reorders idle-driven reactions and, with shutdown-on-idle, can quiesce the powerplant while real work is pending. It manifested in the NUbots Director, which dispatches provider reactions onto the default pool while still holding its Sync<Director> token; the re-entrant provider parks, and on token release the premature idle reordered the Director's idle-driven steps. Consume the pending_idle latch without firing idle: its only job is to wake the worker so it re-checks its queue. The existing dequeue-first / !got path then decides correctly - a drained-runnable waiter is dequeued and run (no idle), while a still-parked waiter leaves the queue empty so the !got branch fires idle exactly as before (preserving cross-pool idle-wake / deadlock-break behaviour). Add the IdleDirectorPingPong regression test reproducing the Director topology. It is fully deterministic and uses no sleeps: the provider and the global idle reaction share a single-worker pool, and priority ordering (REALTIME idle vs LOW provider) means that if the buggy scheduler fires idle while the drained provider is still queued, the idle reaction is dequeued first and observes the pending provider. It fails deterministically before the fix and passes after.
b5b625e to
7235e88
Compare
Summary
Pool::get_taskhonoured thepending_idlelatch by firing an idle epoch before attempting to dequeue. When a parked external waiter (e.g. a task blocked on aSyncgroup's token) becomes runnable and is drained into the pool's queue, firing idle up-front dropped the pool's active count and released itsactive_poolsslot while a runnable task was still queued — letting a global idle epoch fire prematurely.This reorders idle-driven reactions and, with shutdown-on-idle, can quiesce the powerplant while real work is still pending.
How it manifested
It was found via the NUbots Director, which dispatches provider reactions onto the default pool while still holding its
Synctoken. The re-entrant provider parks as an external waiter (armingpending_idle); on token release the parked task is drained into the pool's queue (now runnable), but the stale latch fires idle first, reordering the Director's idle-driven steps and (with shutdown-on-idle) causing early quiescence/hangs.The fix
Consume the
pending_idlelatch without firing idle — its only job is to wake the worker so it re-checks its queue. The existing dequeue-first /!gotpath then decides correctly:!gotbranch fires idle exactly as before (preserving the cross-pool idle-wake / deadlock-break behaviour).Regression test
tests/tests/dsl/IdleParkedSyncWaiter.cppreproduces the topology that triggered the bug (a REALTIME reaction on a concurrency-1 pool holding aSynctoken while re-entering it, driven by deterministicon<Trigger<Step<N>>>steps), generalised away from any Director-specific naming. It fails deterministically before the fix and passes after:PASS=0 FAIL=30PASS=30 FAIL=30Full local
ctestsuite passes.