Fix premature global idle when a parked Sync waiter becomes runnable by TrentHouliston · Pull Request #195 · Fastcode/NUClear

TrentHouliston · 2026-07-03T00:28:38Z

Summary

Pool::get_task honoured the pending_idle latch by firing an idle epoch before attempting to dequeue. When a parked external waiter (e.g. a task blocked on a Sync group's token) becomes runnable and is drained into the pool's queue, firing idle up-front dropped the pool's active count and released its active_pools slot while a runnable task was still queued — letting a global idle epoch fire prematurely.

This reorders idle-driven reactions and, with shutdown-on-idle, can quiesce the powerplant while real work is still pending.

How it manifested

It was found via the NUbots Director, which dispatches provider reactions onto the default pool while still holding its Sync token. The re-entrant provider parks as an external waiter (arming pending_idle); on token release the parked task is drained into the pool's queue (now runnable), but the stale latch fires idle first, reordering the Director's idle-driven steps and (with shutdown-on-idle) causing early quiescence/hangs.

The fix

Consume the pending_idle latch without firing idle — its only job is to wake the worker so it re-checks its queue. The existing dequeue-first / !got path then decides correctly:

a drained-runnable waiter is dequeued and run (no idle),
a still-parked waiter leaves the queue empty so the !got branch fires idle exactly as before (preserving the cross-pool idle-wake / deadlock-break behaviour).

Regression test

tests/tests/dsl/IdleParkedSyncWaiter.cpp reproduces the topology that triggered the bug (a REALTIME reaction on a concurrency-1 pool holding a Sync token while re-entering it, driven by deterministic on<Trigger<Step<N>>> steps), generalised away from any Director-specific naming. It fails deterministically before the fix and passes after:

Without fix: PASS=0 FAIL=30
With fix: PASS=30 FAIL=30

Full local ctest suite passes.

Copilot

Pull request overview

This PR fixes a scheduler edge case in NUClear’s thread pool where a latched pending_idle signal could trigger an idle epoch before attempting to dequeue newly-runnable work, allowing a premature global idle while runnable tasks are still queued.

Changes:

Reorders Pool::get_task() handling so pending_idle is consumed as a wakeup hint without firing idle up-front, deferring the idle decision to the dequeue/!got path.
Adds a regression test that reproduces the “Director” ping-pong topology and asserts correct ordering under idle-driven progression.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
`src/threading/scheduler/Pool.cpp`	Stops firing idle epochs pre-dequeue when `pending_idle` is latched, preventing premature global idle.
`tests/tests/dsl/IdleDirectorPingPong.cpp`	Adds a regression test covering the parked-waiter-becomes-runnable idle ordering scenario.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

TrentHouliston · 2026-07-03T03:25:27Z

+                // The relaxed-ish load short-circuits the (more expensive) store on the common path
+                // where nothing has been latched, so a busy worker never pays for the exclusive
+                // cacheline acquire that an unconditional store would force every iteration.
+                if (pending_idle.load(std::memory_order_acquire)) {
+                    pending_idle.store(false, std::memory_order_release);
                }


🤖 Good catch - reverted this back to pending_idle.exchange(false, std::memory_order_acq_rel) (from the plain store suggested in an earlier review pass). Keeping the consume as an atomic RMW preserves the intended symmetry with register_external_waiter()'s exchange(true) producer-side check, so a concurrent waiter parking right as the worker consumes the latch can't have its signal silently dropped. Added a comment explaining why this must stay an exchange. Verified the fix with a full local ctest run (66/66 passing) plus 20 repeated runs of the deterministic regression test.

🤖 Correction to my previous reply: switching to exchange while still discarding its return value doesn't actually add any protection over store - both just place false into the same total modification order for this atomic, so there's no observable difference between them here. My earlier reply overclaimed.

Reverted back to pending_idle.store(false, std::memory_order_release) and rewrote the comment to explain the actual reason this is safe: submit() (when a drained task is enqueued) and unregister_external_waiter() (when external_waiters returns to 0) both call notify_one()/notify_all() unconditionally, under the pool's mutex, on every state transition the wait predicate actually depends on. pending_idle's own notify (in register_external_waiter(), gated to the 0→1 transition) is purely a latency optimization to promptly wake an already-sleeping worker - if it's ever skipped due to a race, correctness is still preserved by those other unconditional notifies. Re-validated with a full local ctest run (66/66 passing).

Pool::get_task honoured the pending_idle latch by firing an idle epoch before attempting to dequeue. When a parked external waiter (e.g. a task blocked on a Sync group's token) becomes runnable and is drained into the pool's queue, firing idle up-front dropped the pool's active count and released its active_pools slot while a runnable task was still queued, letting a global idle epoch fire prematurely. This reorders idle-driven reactions and, with shutdown-on-idle, can quiesce the powerplant while real work is pending. It manifested in the NUbots Director, which dispatches provider reactions onto the default pool while still holding its Sync<Director> token; the re-entrant provider parks, and on token release the premature idle reordered the Director's idle-driven steps. Consume the pending_idle latch without firing idle: its only job is to wake the worker so it re-checks its queue. The existing dequeue-first / !got path then decides correctly - a drained-runnable waiter is dequeued and run (no idle), while a still-parked waiter leaves the queue empty so the !got branch fires idle exactly as before (preserving cross-pool idle-wake / deadlock-break behaviour). Add the IdleDirectorPingPong regression test reproducing the Director topology. It is fully deterministic and uses no sleeps: the provider and the global idle reaction share a single-worker pool, and priority ordering (REALTIME idle vs LOW provider) means that if the buggy scheduler fires idle while the drained provider is still queued, the idle reaction is dequeued first and observes the pending provider. It fails deterministically before the fix and passes after.

TrentHouliston requested a review from Copilot July 3, 2026 00:52

Copilot started reviewing on behalf of TrentHouliston July 3, 2026 00:52 View session

Copilot AI reviewed Jul 3, 2026

View reviewed changes

Comment thread tests/tests/dsl/IdleDirectorPingPong.cpp Outdated

Comment thread tests/tests/dsl/IdleParkedSyncWaiter.cpp

Comment thread tests/tests/dsl/IdleDirectorPingPong.cpp Outdated

Comment thread tests/tests/dsl/IdleDirectorPingPong.cpp Outdated

TrentHouliston force-pushed the fix/premature-global-idle-parked-waiter branch 2 times, most recently from 2a0c1da to fa3cbf8 Compare July 3, 2026 02:14

TrentHouliston requested a review from Copilot July 3, 2026 02:15

Copilot started reviewing on behalf of TrentHouliston July 3, 2026 02:16 View session

Copilot AI reviewed Jul 3, 2026

View reviewed changes

Comment thread tests/tests/dsl/IdleParkedSyncWaiter.cpp Outdated

Comment thread tests/tests/dsl/IdleParkedSyncWaiter.cpp

Comment thread src/threading/scheduler/Pool.cpp

Comment thread src/threading/scheduler/Pool.cpp

TrentHouliston force-pushed the fix/premature-global-idle-parked-waiter branch from fa3cbf8 to 63bfb11 Compare July 3, 2026 02:26

TrentHouliston requested a review from Copilot July 3, 2026 03:02

Copilot started reviewing on behalf of TrentHouliston July 3, 2026 03:02 View session

Copilot started work on behalf of TrentHouliston July 3, 2026 03:04 View session

Copilot AI reviewed Jul 3, 2026

View reviewed changes

Copilot finished work on behalf of TrentHouliston July 3, 2026 03:07

TrentHouliston force-pushed the fix/premature-global-idle-parked-waiter branch 2 times, most recently from d37adca to b5b625e Compare July 3, 2026 03:43

TrentHouliston force-pushed the fix/premature-global-idle-parked-waiter branch from b5b625e to 7235e88 Compare July 3, 2026 03:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix premature global idle when a parked Sync waiter becomes runnable#195

Fix premature global idle when a parked Sync waiter becomes runnable#195
TrentHouliston wants to merge 1 commit into
mainfrom
fix/premature-global-idle-parked-waiter

TrentHouliston commented Jul 3, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

TrentHouliston Jul 3, 2026

Uh oh!

TrentHouliston Jul 3, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

TrentHouliston commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

How it manifested

The fix

Regression test

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

TrentHouliston Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

TrentHouliston Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

TrentHouliston commented Jul 3, 2026 •

edited

Loading