AudioDecember 6, 2022OpenAI
Robust Speech Recognition via Large-Scale Weak Supervision
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, Ilya Sutskever
Abstract
We study the capabilities of speech processing systems trained simply to predict large amounts of transcripts of audio on the internet. When scaled to 680,000 hours of multilingual and multitask supervision, the resulting models generalize well to standard benchmarks and are competitive with prior fully supervised results without the need for any fine-tuning.
Key Findings
- 1Trained on 680,000 hours of multilingual audio data from the internet
- 2Achieved near-human-level speech recognition without task-specific fine-tuning
- 3Supported transcription in 97 languages and translation to English
- 4Demonstrated robust performance across accents, noise, and domains
- 5Released as open-source enabling broad adoption
Impact & Significance
Whisper democratized high-quality speech recognition by providing an open-source model that works across languages and conditions. It became the go-to choice for transcription in countless applications and inspired numerous fine-tuned variants.
Related Tools
Related Papers
LLMJuly 23, 2024
The Llama 3 Herd of Models
Meta AI
LLMJuly 15, 2024
Qwen2 Technical Report
Alibaba Cloud / Qwen Team
EfficiencyMay 7, 2024
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
DeepSeek AI
LLMMarch 4, 2024
The Claude 3 Model Family: Opus, Sonnet, and Haiku
Anthropic