logo

Open AI Whisper

Text generation

Whisper is an automatic speech recognition (ASR) system trained on 680,000 hours of multilingual data, offering improved robustness to accents, noise, and technical language. It transcribes and translates multiple languages into English.


Intended Use

  • Whisper is useful as an ASR solution, especially for English speech recognition.

  • The models are primarily trained and evaluated on ASR and speech translation to English.

  • They show strong ASR results in about 10 languages.

  • They may exhibit additional capabilities if fine-tuned for tasks like voice activity detection, speaker classification, or speaker diarization.


Performance

  • Speech recognition and translation accuracy is near state-of-the-art.

  • Performance varies across languages, with lower accuracy on low-resource or low-discoverability languages.

  • Whisper shows varying performance on different accents and dialects of languages.


Limitations

  • Whisper is trained in a weakly supervised manner using large-scale noisy data, leading to potential hallucinations.

  • Hallucinations occur as the models combine predicting the next word and transcribing audio.

  • The sequence-to-sequence architecture may generate repetitive text, which can be partially mitigated by beam search and temperature scheduling.

  • These issues may be more pronounced in lower-resource and/or lower-discoverability languages.

  • Higher word error rates may occur across speakers of different genders, races, ages, or other demographics.


Citation

https://openai.com/index/whisper/