Whisper-large-v3 Estonian

Fine-tuned OpenAI Whisper-large-v3 for Estonian automatic speech recognition, augmented with TTS-generated synthetic data.

This model is part of the paper: "Synthetic Speech Augmentation for Low-Resource Estonian and Slovenian ASR: Comparing Parakeet-TDT and Whisper" (Interspeech 2026). Paper coming soon.

Model Description

  • Architecture: Encoder-decoder Transformer (Whisper)
  • Parameters: 1.55B
  • Tokenizer: 51,865-token byte-level BPE
  • Base model: openai/whisper-large-v3
  • Fine-tuning data: CommonVoice 17.0 Estonian + ~5,850 synthetic sentences (LLM-generated text + OpenAI TTS)
  • Training config: CV + Synth All (full synthetic corpus with quality filtering)

Evaluation Results

Raw WER/CER (no text normalization)

Test Set WER CER
CommonVoice 17 Test 26.46 6.24
CommonVoice 17 Val 25.25 5.63
FLEURS Test 36.56 7.07

Normalized WER/CER (lowercase + punctuation removal)

Test Set WER CER
CommonVoice 17 Test 24.00 5.69
CommonVoice 17 Val 23.11 5.15
FLEURS Test 13.89 3.20

Improvement over baselines

Comparison CV17 Test (WER) FLEURS Test (WER)
vs. Zero-shot -7.94 pp -4.16 pp
vs. CV-only fine-tuning -2.92 pp -1.95 pp

All improvements are statistically significant (paired bootstrap, p < 0.001, n = 100,000).

Usage

from transformers import WhisperForConditionalGeneration, WhisperProcessor
import torch

# Load model and processor
model = WhisperForConditionalGeneration.from_pretrained("yuriyvnv/whisper-large-v3-estonian")
processor = WhisperProcessor.from_pretrained("yuriyvnv/whisper-large-v3-estonian")

# Load audio (16kHz mono)
import librosa
audio, sr = librosa.load("audio.wav", sr=16000)

# Transcribe
input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features
with torch.no_grad():
    predicted_ids = model.generate(input_features, language="et", task="transcribe")
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)

Training Details

  • Optimizer: AdamW fused (lr=5e-5)
  • Schedule: Linear decay with 10% warmup
  • Effective batch size: 128 (64 x 2 gradient accumulation)
  • Epochs: 5
  • Best model: selected by eval_loss
  • Precision: bf16
  • Seed: 42

Synthetic Data Augmentation

The synthetic training data was generated using a three-stage pipeline:

  1. Text generation: GPT-5-mini generates diverse sentences across paraphrase, domain expansion, and morphological categories
  2. LLM-as-judge validation: Each sentence validated for grammaticality, naturalness, and language purity
  3. Speech synthesis: OpenAI gpt-4o-mini-tts with 11-voice rotation

Dataset: yuriyvnv/synthetic_asr_et_sl

Acknowledgments

Downloads last month
2
Safetensors
Model size
2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for yuriyvnv/whisper-large-v3-estonian

Finetuned
(814)
this model

Datasets used to train yuriyvnv/whisper-large-v3-estonian

Evaluation results