Skip to content

Remove openai-whisper dependency for log_mel_spectrogram#1846

Open
musselmanjoey wants to merge 1 commit intoFunAudioLLM:mainfrom
musselmanjoey:remove-whisper-dependency
Open

Remove openai-whisper dependency for log_mel_spectrogram#1846
musselmanjoey wants to merge 1 commit intoFunAudioLLM:mainfrom
musselmanjoey:remove-whisper-dependency

Conversation

@musselmanjoey
Copy link

Summary

  • Replace all whisper.log_mel_spectrogram() calls with a lightweight implementation using torch + torchaudio (already required deps)
  • Add cosyvoice/utils/audio_utils.py with a drop-in log_mel_spectrogram() function
  • Remove openai-whisper==20231117 from requirements.txt
  • Move the legacy whisper.tokenizer.Tokenizer import (used only by CosyVoice v1's get_tokenizer()) to a lazy import so it doesn't break module loading

Motivation

openai-whisper is a ~1.5GB speech recognition package, but CosyVoice only uses one utility function from it: whisper.log_mel_spectrogram(). This causes widespread installation failures due to dependency conflicts, especially on platforms with pre-installed PyTorch (Kaggle, Colab, etc).

Related issues: #1844, #1266, #249, #316

Details

log_mel_spectrogram is a standard audio preprocessing operation (STFT → mel filterbank → log scaling). The replacement in audio_utils.py uses torch.stft and torchaudio.functional.melscale_fbanks with the same parameters as Whisper (n_fft=400, hop_length=160, 16kHz sample rate), producing numerically equivalent output.

Files changed:

  • cosyvoice/utils/audio_utils.py (new) — shared log_mel_spectrogram implementation
  • cosyvoice/cli/frontend.py — use audio_utils.log_mel_spectrogram instead of whisper
  • cosyvoice/dataset/processor.py — same replacement
  • tools/extract_speech_token.py — same replacement
  • cosyvoice/tokenizer/tokenizer.py — lazy import of whisper.tokenizer.Tokenizer (only needed for v1 tokenizer path)
  • requirements.txt — remove openai-whisper

🤖 Generated with Claude Code

openai-whisper is a heavy (~1.5GB) speech recognition package but
CosyVoice only uses whisper.log_mel_spectrogram() — a standard
audio preprocessing utility. This causes widespread installation
failures (see FunAudioLLM#1844, FunAudioLLM#1266, FunAudioLLM#249, FunAudioLLM#316) due to dependency conflicts,
especially on platforms with pre-installed PyTorch (Kaggle, Colab).

Replace all whisper.log_mel_spectrogram() calls with a lightweight
implementation in cosyvoice/utils/audio_utils.py that uses only
torch and torchaudio (already required dependencies). The output
is numerically equivalent.

The legacy get_tokenizer() function (CosyVoice v1) still needs
whisper.tokenizer.Tokenizer, so that import is moved to a lazy
import inside the function body — it only triggers if you actually
use the v1 tokenizer path. CosyVoice2/3 tokenizers are unaffected.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant