Whisper-large-v3 Estonian
Fine-tuned OpenAI Whisper-large-v3 for Estonian automatic speech recognition, augmented with TTS-generated synthetic data.
This model is part of the paper: "Synthetic Speech Augmentation for Low-Resource Estonian and Slovenian ASR: Comparing Parakeet-TDT and Whisper" (Interspeech 2026). Paper coming soon.
Model Description
- Architecture: Encoder-decoder Transformer (Whisper)
- Parameters: 1.55B
- Tokenizer: 51,865-token byte-level BPE
- Base model:
openai/whisper-large-v3 - Fine-tuning data: CommonVoice 17.0 Estonian + ~5,850 synthetic sentences (LLM-generated text + OpenAI TTS)
- Training config: CV + Synth All (full synthetic corpus with quality filtering)
Evaluation Results
Raw WER/CER (no text normalization)
| Test Set | WER | CER |
|---|---|---|
| CommonVoice 17 Test | 26.46 | 6.24 |
| CommonVoice 17 Val | 25.25 | 5.63 |
| FLEURS Test | 36.56 | 7.07 |
Normalized WER/CER (lowercase + punctuation removal)
| Test Set | WER | CER |
|---|---|---|
| CommonVoice 17 Test | 24.00 | 5.69 |
| CommonVoice 17 Val | 23.11 | 5.15 |
| FLEURS Test | 13.89 | 3.20 |
Improvement over baselines
| Comparison | CV17 Test (WER) | FLEURS Test (WER) |
|---|---|---|
| vs. Zero-shot | -7.94 pp | -4.16 pp |
| vs. CV-only fine-tuning | -2.92 pp | -1.95 pp |
All improvements are statistically significant (paired bootstrap, p < 0.001, n = 100,000).
Usage
from transformers import WhisperForConditionalGeneration, WhisperProcessor
import torch
# Load model and processor
model = WhisperForConditionalGeneration.from_pretrained("yuriyvnv/whisper-large-v3-estonian")
processor = WhisperProcessor.from_pretrained("yuriyvnv/whisper-large-v3-estonian")
# Load audio (16kHz mono)
import librosa
audio, sr = librosa.load("audio.wav", sr=16000)
# Transcribe
input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features
with torch.no_grad():
predicted_ids = model.generate(input_features, language="et", task="transcribe")
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)
Training Details
- Optimizer: AdamW fused (lr=5e-5)
- Schedule: Linear decay with 10% warmup
- Effective batch size: 128 (64 x 2 gradient accumulation)
- Epochs: 5
- Best model: selected by eval_loss
- Precision: bf16
- Seed: 42
Synthetic Data Augmentation
The synthetic training data was generated using a three-stage pipeline:
- Text generation: GPT-5-mini generates diverse sentences across paraphrase, domain expansion, and morphological categories
- LLM-as-judge validation: Each sentence validated for grammaticality, naturalness, and language purity
- Speech synthesis: OpenAI gpt-4o-mini-tts with 11-voice rotation
Dataset: yuriyvnv/synthetic_asr_et_sl
Acknowledgments
- Base model: OpenAI Whisper-large-v3
- Training data: Mozilla Common Voice 17.0
- Evaluation: Google FLEURS
- Downloads last month
- 2
Model tree for yuriyvnv/whisper-large-v3-estonian
Base model
openai/whisper-large-v3Datasets used to train yuriyvnv/whisper-large-v3-estonian
Evaluation results
- WER (raw) on Common Voice 17.0 (et) - Testtest set self-reported26.460
- WER (normalized) on Common Voice 17.0 (et) - Testtest set self-reported24.000
- CER (raw) on Common Voice 17.0 (et) - Testtest set self-reported6.240
- WER (raw) on FLEURS (et) - Testtest set self-reported36.560
- WER (normalized) on FLEURS (et) - Testtest set self-reported13.890
- CER (raw) on FLEURS (et) - Testtest set self-reported7.070