Skip to content

Fix premature global idle when a parked Sync waiter becomes runnable#195

Open
TrentHouliston wants to merge 1 commit into
mainfrom
fix/premature-global-idle-parked-waiter
Open

Fix premature global idle when a parked Sync waiter becomes runnable#195
TrentHouliston wants to merge 1 commit into
mainfrom
fix/premature-global-idle-parked-waiter

Conversation

@TrentHouliston

@TrentHouliston TrentHouliston commented Jul 3, 2026

Copy link
Copy Markdown
Member

Summary

Pool::get_task honoured the pending_idle latch by firing an idle epoch before attempting to dequeue. When a parked external waiter (e.g. a task blocked on a Sync group's token) becomes runnable and is drained into the pool's queue, firing idle up-front dropped the pool's active count and released its active_pools slot while a runnable task was still queued — letting a global idle epoch fire prematurely.

This reorders idle-driven reactions and, with shutdown-on-idle, can quiesce the powerplant while real work is still pending.

How it manifested

It was found via the NUbots Director, which dispatches provider reactions onto the default pool while still holding its Sync token. The re-entrant provider parks as an external waiter (arming pending_idle); on token release the parked task is drained into the pool's queue (now runnable), but the stale latch fires idle first, reordering the Director's idle-driven steps and (with shutdown-on-idle) causing early quiescence/hangs.

The fix

Consume the pending_idle latch without firing idle — its only job is to wake the worker so it re-checks its queue. The existing dequeue-first / !got path then decides correctly:

  • a drained-runnable waiter is dequeued and run (no idle),
  • a still-parked waiter leaves the queue empty so the !got branch fires idle exactly as before (preserving the cross-pool idle-wake / deadlock-break behaviour).

Regression test

tests/tests/dsl/IdleParkedSyncWaiter.cpp reproduces the topology that triggered the bug (a REALTIME reaction on a concurrency-1 pool holding a Sync token while re-entering it, driven by deterministic on<Trigger<Step<N>>> steps), generalised away from any Director-specific naming. It fails deterministically before the fix and passes after:

  • Without fix: PASS=0 FAIL=30
  • With fix: PASS=30 FAIL=30

Full local ctest suite passes.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a scheduler edge case in NUClear’s thread pool where a latched pending_idle signal could trigger an idle epoch before attempting to dequeue newly-runnable work, allowing a premature global idle while runnable tasks are still queued.

Changes:

  • Reorders Pool::get_task() handling so pending_idle is consumed as a wakeup hint without firing idle up-front, deferring the idle decision to the dequeue/!got path.
  • Adds a regression test that reproduces the “Director” ping-pong topology and asserts correct ordering under idle-driven progression.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
src/threading/scheduler/Pool.cpp Stops firing idle epochs pre-dequeue when pending_idle is latched, preventing premature global idle.
tests/tests/dsl/IdleDirectorPingPong.cpp Adds a regression test covering the parked-waiter-becomes-runnable idle ordering scenario.

Comment thread tests/tests/dsl/IdleDirectorPingPong.cpp Outdated
Comment thread tests/tests/dsl/IdleParkedSyncWaiter.cpp
Comment thread tests/tests/dsl/IdleDirectorPingPong.cpp Outdated
Comment thread tests/tests/dsl/IdleDirectorPingPong.cpp Outdated
@TrentHouliston TrentHouliston force-pushed the fix/premature-global-idle-parked-waiter branch 2 times, most recently from 2a0c1da to fa3cbf8 Compare July 3, 2026 02:14
@TrentHouliston TrentHouliston requested a review from Copilot July 3, 2026 02:15

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

Comment thread tests/tests/dsl/IdleParkedSyncWaiter.cpp Outdated
Comment thread tests/tests/dsl/IdleParkedSyncWaiter.cpp
Comment thread src/threading/scheduler/Pool.cpp
Comment thread src/threading/scheduler/Pool.cpp

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

Comment on lines +358 to 363
// The relaxed-ish load short-circuits the (more expensive) store on the common path
// where nothing has been latched, so a busy worker never pays for the exclusive
// cacheline acquire that an unconditional store would force every iteration.
if (pending_idle.load(std::memory_order_acquire)) {
pending_idle.store(false, std::memory_order_release);
}

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Good catch - reverted this back to pending_idle.exchange(false, std::memory_order_acq_rel) (from the plain store suggested in an earlier review pass). Keeping the consume as an atomic RMW preserves the intended symmetry with register_external_waiter()'s exchange(true) producer-side check, so a concurrent waiter parking right as the worker consumes the latch can't have its signal silently dropped. Added a comment explaining why this must stay an exchange. Verified the fix with a full local ctest run (66/66 passing) plus 20 repeated runs of the deterministic regression test.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Correction to my previous reply: switching to exchange while still discarding its return value doesn't actually add any protection over store - both just place false into the same total modification order for this atomic, so there's no observable difference between them here. My earlier reply overclaimed.

Reverted back to pending_idle.store(false, std::memory_order_release) and rewrote the comment to explain the actual reason this is safe: submit() (when a drained task is enqueued) and unregister_external_waiter() (when external_waiters returns to 0) both call notify_one()/notify_all() unconditionally, under the pool's mutex, on every state transition the wait predicate actually depends on. pending_idle's own notify (in register_external_waiter(), gated to the 0→1 transition) is purely a latency optimization to promptly wake an already-sleeping worker - if it's ever skipped due to a race, correctness is still preserved by those other unconditional notifies. Re-validated with a full local ctest run (66/66 passing).

Comment thread src/threading/scheduler/Pool.cpp Outdated
Comment thread src/threading/scheduler/Pool.hpp Outdated
@TrentHouliston TrentHouliston force-pushed the fix/premature-global-idle-parked-waiter branch 2 times, most recently from d37adca to b5b625e Compare July 3, 2026 03:43
Pool::get_task honoured the pending_idle latch by firing an idle epoch
before attempting to dequeue. When a parked external waiter (e.g. a task
blocked on a Sync group's token) becomes runnable and is drained into the
pool's queue, firing idle up-front dropped the pool's active count and
released its active_pools slot while a runnable task was still queued,
letting a global idle epoch fire prematurely.

This reorders idle-driven reactions and, with shutdown-on-idle, can
quiesce the powerplant while real work is pending. It manifested in the
NUbots Director, which dispatches provider reactions onto the default
pool while still holding its Sync<Director> token; the re-entrant
provider parks, and on token release the premature idle reordered the
Director's idle-driven steps.

Consume the pending_idle latch without firing idle: its only job is to
wake the worker so it re-checks its queue. The existing dequeue-first /
!got path then decides correctly - a drained-runnable waiter is dequeued
and run (no idle), while a still-parked waiter leaves the queue empty so
the !got branch fires idle exactly as before (preserving cross-pool
idle-wake / deadlock-break behaviour).

Add the IdleDirectorPingPong regression test reproducing the Director
topology. It is fully deterministic and uses no sleeps: the provider and
the global idle reaction share a single-worker pool, and priority
ordering (REALTIME idle vs LOW provider) means that if the buggy
scheduler fires idle while the drained provider is still queued, the idle
reaction is dequeued first and observes the pending provider. It fails
deterministically before the fix and passes after.
@TrentHouliston TrentHouliston force-pushed the fix/premature-global-idle-parked-waiter branch from b5b625e to 7235e88 Compare July 3, 2026 03:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants