Models / OpenAIOpenAI / / Whisper Large v3 API
Whisper Large v3 API
State-of-the-art automatic speech recognition and translation model supporting 99 languages with 10-20% error reduction over previous versions.

Whisper Large v3 API Usage
Endpoint
How to use Whisper Large v3
Model details
Architecture Overview:
• Transformer-based encoder-decoder model optimized for speech recognition and translation
• 1.55 billion parameters with improved spectrogram processing using 128 Mel frequency bins
• Enhanced multilingual capabilities with support for 99 languages including new Cantonese token
• 30-second audio processing window with chunked long-form algorithm support
Training Methodology:
• Trained on 1 million hours of weakly labeled audio and 4 million hours of pseudo-labeled data
• Large-scale weak supervision approach using diverse multilingual audio datasets
• 2.0 epochs of training over mixture dataset collected using Whisper large-v2
• Robust performance across accents, background noise, and technical language
Performance Characteristics:
• 10-20% error reduction compared to Whisper large-v2 across multiple languages
• Near state-of-the-art accuracy for speech recognition and translation tasks
• Zero-shot translation capabilities from multiple languages into English
• Compatible with torch.compile for 4.5x inference speed improvements
Prompting Whisper Large v3
Audio Processing:
• Supports automatic language detection for 99 languages
• Manual language specification available for improved accuracy
• Speech transcription mode for same-language output
• Speech translation mode for English output regardless of input language
Timestamp Control:
• Sentence-level timestamps with return_timestamps=True parameter
• Word-level timestamps with return_timestamps="word" parameter
• Flexible timestamp generation for precise audio alignment
• Configurable temperature fallback and beam search strategies
Optimization Techniques:
• Chunked processing for files longer than 30 seconds
• Batch processing support for multiple audio files
• Sequential vs chunked algorithms based on accuracy vs speed requirements
• Flash Attention 2 and SDPA compatibility for memory efficiency
Applications & Use Cases
Speech Recognition Applications:
• Accessibility tools and real-time transcription systems
• Meeting transcription and conference call documentation
• Podcast and media content transcription for searchability
• Voice-to-text applications for content creation and note-taking
Multilingual Translation:
• International business communication and conference translation
• Educational content localization and language learning platforms
• Customer support systems for global organizations
• Media translation and subtitle generation for entertainment content
Professional & Technical:
• Legal proceedings transcription and court reporting
• Medical dictation and healthcare documentation systems
• Academic research transcription and interview analysis
• Voice activity detection and speaker diarization when fine-tuned
Development & Integration:
• API integration for mobile and web applications
• Voice-controlled interfaces and smart home devices
• Call center analytics and conversation intelligence
• Content moderation and audio content analysis platforms