Whisper only transcribes the English words and ignores the rest of the Spanish audio

#220
by davera-017 - opened

I’m having an issue transcribing a Spanish audio clip (a speech by Octavio Paz). The audio is mostly in Spanish, but the speaker says a couple of words in English. However, Whisper only transcribes that short English fragment and ignores the rest of the Spanish speech.

Here is the code I’m using (Transformers 4.57.3):

from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
import torch

model_id = "openai/whisper-medium"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id,
    dtype=torch_dtype,
    low_cpu_mem_usage=True,
    cache_dir="../data/models",
).to(device)

processor = AutoProcessor.from_pretrained(
    model_id,
    cache_dir="../data/models"
)

# Disable forced_decoder_ids to avoid automatic translation
model.config.forced_decoder_ids = None

gen_kwargs = {
    "language": "es",
    "task": "transcribe",
    "compression_ratio_threshold": 1.35,
    "return_timestamps": True,
}

inputs = processor(
    [c.numpy() for c in waveform_tensor],
    sampling_rate=16000,
    return_tensors="pt",
    padding="longest"
)

input_features = inputs.input_features.to(device)
generated_ids = model.generate(input_features, **gen_kwargs)
batch_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)

And I'm attaching the chunk that presents the problem.

Sign up or log in to comment