On Windows, the ThreadPool destructor can deadlock during CUDA model
cleanup. The root cause is a race condition between queue.close() and
the worker's idle() callback:
1. Worker enters idle() -> synchronize_stream() (blocking CUDA call)
2. Main thread calls queue.close() -> sets _request_end, notifies
3. Worker is stuck in synchronize_stream(), misses the notification
4. Worker::join() blocks indefinitely
This manifests as application hangs when unloading Whisper/NLLB models
after transcription or translation completes. Confirmed on RTX 4090
and RTX 5070 (Blackwell) with CUDA 12.x and 13.x.
Changes:
- Add Worker::prepare_shutdown() virtual method, called by
ThreadPool::~ThreadPool() BEFORE queue.close(). This allows
workers to stop blocking idle operations before the queue signals
shutdown.
- ReplicaWorker overrides prepare_shutdown() to set _shutting_down
atomic flag with release semantics. The idle() method checks this
flag with acquire semantics before calling synchronize_stream().
- Worker::join() now accepts a timeout_ms parameter (default 5000ms).
If the worker thread doesn't finish within the timeout, both the
worker thread and the join helper are detached to prevent blocking
the process. This handles the case where finalize() -> _replica.reset()
hangs on CUDA resource deallocation.
- JobQueue::get() before_wait loop now checks _request_end before
calling before_wait(), and releases the lock during the callback
to prevent holding the mutex during blocking CUDA operations.
Tested with CTranslate2 Whisper and NLLB models on Windows 10/11
with CUDA 12.8 and 13.2 (sm_75 through sm_120).
Summary
Fixes application hang when destroying CTranslate2 models on Windows with CUDA GPU acceleration. The
ThreadPooldestructor deadlocks because worker threads get stuck in blocking CUDA synchronization calls during shutdown.Problem
When a CTranslate2 model is destroyed (e.g., after Whisper transcription or NLLB translation completes), the
ThreadPooldestructor callsqueue.close()thenworker->join(). A race condition causes deadlock:idle()→synchronize_stream()(blocking CUDA call)queue.close()→ sets_request_end, notifies condition variablesynchronize_stream(), not waiting on the condition variable — notification is lostWorker::join()blocks indefinitely waiting for the worker to exitThis affects any application using CTranslate2 with CUDA on Windows. Confirmed on:
Related issues: #1782, SYSTRAN/faster-whisper#71
Changes
include/ctranslate2/thread_pool.h:Worker::join()now acceptstimeout_msparameter (default 5000ms). If the worker doesn't finish in time, both threads are detached to prevent indefinite blocking.Worker::prepare_shutdown()virtual method, called beforequeue.close().include/ctranslate2/replica_pool.h:ReplicaWorkeroverridesprepare_shutdown()to set_shutting_downatomic flag with release semantics.idle()checks the flag with acquire semantics before callingsynchronize_stream(), preventing the race.finalize()also sets the flag before resetting the replica.std::atomic<bool> _shutting_downmember.src/thread_pool.cc:ThreadPool::~ThreadPool()callsprepare_shutdown()on all workers BEFOREqueue.close().JobQueue::get()releases the mutex before callingbefore_wait()callback and checks_request_endto avoid callingsynchronize_stream()during shutdown.Worker::join()implements timed join usingstd::promise/std::futurewith detach fallback.Test plan