Skip to content

Fix process-killing integer division by zero when Whisper align() gets a window with no frames#2065

Open
MAkcanca wants to merge 2 commits into
OpenNMT:masterfrom
MAkcanca:fix-median-filter-zero-depth
Open

Fix process-killing integer division by zero when Whisper align() gets a window with no frames#2065
MAkcanca wants to merge 2 commits into
OpenNMT:masterfrom
MAkcanca:fix-median-filter-zero-depth

Conversation

@MAkcanca

Copy link
Copy Markdown

Problem

Whisper::align() halves each num_frames entry for the encoder''s stride-2 convolution. A window with num_frames < 2 therefore ends up with 0 frames, and the zero-size time axis flows into MedianFilter, whose CPU and GPU implementations both compute

const dim_t batch_size = input.size() / depth;   // depth == 0 → integer division by zero

before their depth <= rank short-axis guard. Integer division by zero is UB; in the CUDA build''s host code it deterministically kills the host process — 0xC0000094 (STATUS_INTEGER_DIVIDE_BY_ZERO) on Windows — with no catchable exception. (The CPU build happens to survive on MSVC because the optimizer reorders the division past the early return, which is why reports of this are sparse and platform-dependent.)

This is reachable from the public API with real-world input: faster-whisper''s VAD path produces ~10ms final speech chunks on audio where Whisper hallucinates past the end of the clip, and calls align() with num_frames=1. That is the root cause of SYSTRAN/faster-whisper#1342 ("Divide by zero crash on Windows for specific audio files when word level timestamps are enabled").

Reproduction (no audio file needed)

import ctranslate2, numpy as np, huggingface_hub

model_dir = huggingface_hub.snapshot_download("Systran/faster-whisper-tiny")
model = ctranslate2.models.Whisper(model_dir, device="cuda")

features = np.zeros((1, model.n_mels, 3000), dtype=np.float32)
encoded = model.encode(ctranslate2.StorageView.from_array(features))

model.align(encoded, [50258, 50259, 50359], [[1396, 264, 1002]], [3000])  # fine
model.align(encoded, [50258, 50259, 50359], [[1396, 264, 1002]], [1])     # kills the process

Fix (two commits)

  1. MedianFilter (CPU + GPU): move the depth <= rank guard above the division. The guard already existed in both files — it just ran after the division it should protect. The CPU early-return now also copies input → output like the GPU path does (previously it returned an uninitialized buffer for short axes).
  2. Whisper::align(): return empty alignments for windows with no frames left after the stride halving. Even with the MedianFilter fix, running the attention post-processing on zero-size tensors fails downstream (thrust parallel_for error on CUDA) or produces meaningless (token, -1) pairs on CPU. No frames → no alignment is the honest result, and it is consistent across devices.

Added OpDeviceFPTest.MedianFilterShortAxis covering the zero-size axis and the pass-through behavior.

Verification

  • New test passes on CPU and CUDA (float32/float16/bfloat16); full ctranslate2_test suite: 360 tests, 353 pass, 4 skipped. 3 pre-existing failures in CPU/OpDeviceFPTest.Gemm*/float32 are float32 GEMM numerics under the Ruy fallback backend of my local build (no MKL installed) and are untouched by this change.
  • The reproduction above: control window aligns normally (1503 pairs), num_frames=[1] returns an empty alignment instead of killing the process — verified on Windows/CUDA 12.0 (RTX 3090, sm_86) and CPU.
  • End-to-end through faster-whisper: the originally-crashing real-world clip now transcribes completely (the degenerate window is Whisper''s trailing-credits hallucination past the audio end, returned with no word timestamps).

AI assistance disclosure

Per the contribution guidelines: I used an AI assistant (Claude) to help trace the crash to this code and draft the patch. I directed the investigation, and every claim above was verified empirically on my machine: the degenerate input was captured live with faulthandler + an instrumented align() call, the fix was validated with a local CUDA build, the test suite, the minimal repro, and the original real-world clip. I am responsible for the change and happy to adjust it.

MAkcanca added 2 commits June 12, 2026 14:50
Both the CPU and GPU implementations compute `input.size() / depth`
before the `depth <= rank` short-axis guard. When the filtered axis has
size 0, this is an integer division by zero: a native crash
(0xC0000094 STATUS_INTEGER_DIVIDE_BY_ZERO on Windows, SIGFPE on Linux)
that cannot be caught from Python.

This is reachable from the public API: Whisper::align() halves
num_frames for the encoder stride, so num_frames=1 (a ~10ms window,
which faster-whisper''s VAD path produces on real audio; see
SYSTRAN/faster-whisper#1342) yields a zero-size attention axis and
kills the host process inside MedianFilter.

Move the guard before the division in both implementations, and make
the CPU early-return copy the input through like the GPU path does
(previously it returned uninitialized output for short axes).
After the encoder-stride halving, num_frames < 2 leaves zero frames to
align against. Running the attention post-processing on the resulting
zero-size tensors is undefined: before the MedianFilter fix it was an
integer division by zero, with it the empty tensors still fail
downstream (thrust parallel_for error on CUDA) or produce meaningless
(token, -1) alignment pairs on CPU.

Skip the post-processing for such windows and return an empty
alignment, which is the honest result: no frames, no alignment.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant