Fix process-killing integer division by zero when Whisper align() gets a window with no frames#2065
Open
MAkcanca wants to merge 2 commits into
Open
Fix process-killing integer division by zero when Whisper align() gets a window with no frames#2065MAkcanca wants to merge 2 commits into
MAkcanca wants to merge 2 commits into
Conversation
Both the CPU and GPU implementations compute `input.size() / depth` before the `depth <= rank` short-axis guard. When the filtered axis has size 0, this is an integer division by zero: a native crash (0xC0000094 STATUS_INTEGER_DIVIDE_BY_ZERO on Windows, SIGFPE on Linux) that cannot be caught from Python. This is reachable from the public API: Whisper::align() halves num_frames for the encoder stride, so num_frames=1 (a ~10ms window, which faster-whisper''s VAD path produces on real audio; see SYSTRAN/faster-whisper#1342) yields a zero-size attention axis and kills the host process inside MedianFilter. Move the guard before the division in both implementations, and make the CPU early-return copy the input through like the GPU path does (previously it returned uninitialized output for short axes).
After the encoder-stride halving, num_frames < 2 leaves zero frames to align against. Running the attention post-processing on the resulting zero-size tensors is undefined: before the MedianFilter fix it was an integer division by zero, with it the empty tensors still fail downstream (thrust parallel_for error on CUDA) or produce meaningless (token, -1) alignment pairs on CPU. Skip the post-processing for such windows and return an empty alignment, which is the honest result: no frames, no alignment.
This was referenced Jun 12, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Whisper::align()halves eachnum_framesentry for the encoder''s stride-2 convolution. A window withnum_frames < 2therefore ends up with 0 frames, and the zero-size time axis flows intoMedianFilter, whose CPU and GPU implementations both computebefore their
depth <= rankshort-axis guard. Integer division by zero is UB; in the CUDA build''s host code it deterministically kills the host process —0xC0000094(STATUS_INTEGER_DIVIDE_BY_ZERO) on Windows — with no catchable exception. (The CPU build happens to survive on MSVC because the optimizer reorders the division past the early return, which is why reports of this are sparse and platform-dependent.)This is reachable from the public API with real-world input: faster-whisper''s VAD path produces ~10ms final speech chunks on audio where Whisper hallucinates past the end of the clip, and calls
align()withnum_frames=1. That is the root cause of SYSTRAN/faster-whisper#1342 ("Divide by zero crash on Windows for specific audio files when word level timestamps are enabled").Reproduction (no audio file needed)
Fix (two commits)
MedianFilter(CPU + GPU): move thedepth <= rankguard above the division. The guard already existed in both files — it just ran after the division it should protect. The CPU early-return now also copies input → output like the GPU path does (previously it returned an uninitialized buffer for short axes).Whisper::align(): return empty alignments for windows with no frames left after the stride halving. Even with the MedianFilter fix, running the attention post-processing on zero-size tensors fails downstream (thrustparallel_forerror on CUDA) or produces meaningless(token, -1)pairs on CPU. No frames → no alignment is the honest result, and it is consistent across devices.Added
OpDeviceFPTest.MedianFilterShortAxiscovering the zero-size axis and the pass-through behavior.Verification
ctranslate2_testsuite: 360 tests, 353 pass, 4 skipped. 3 pre-existing failures inCPU/OpDeviceFPTest.Gemm*/float32are float32 GEMM numerics under the Ruy fallback backend of my local build (no MKL installed) and are untouched by this change.num_frames=[1]returns an empty alignment instead of killing the process — verified on Windows/CUDA 12.0 (RTX 3090, sm_86) and CPU.AI assistance disclosure
Per the contribution guidelines: I used an AI assistant (Claude) to help trace the crash to this code and draft the patch. I directed the investigation, and every claim above was verified empirically on my machine: the degenerate input was captured live with
faulthandler+ an instrumentedalign()call, the fix was validated with a local CUDA build, the test suite, the minimal repro, and the original real-world clip. I am responsible for the change and happy to adjust it.