Whisper only transcribes the English words and ignores the rest of the Spanish audio

#220

by davera-017 - opened 17 days ago

17 days ago

I’m having an issue transcribing a Spanish audio clip (a speech by Octavio Paz). The audio is mostly in Spanish, but the speaker says a couple of words in English. However, Whisper only transcribes that short English fragment and ignores the rest of the Spanish speech.

Here is the code I’m using (Transformers 4.57.3):

from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
import torch

model_id = "openai/whisper-medium"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id,
    dtype=torch_dtype,
    low_cpu_mem_usage=True,
    cache_dir="../data/models",
).to(device)

processor = AutoProcessor.from_pretrained(
    model_id,
    cache_dir="../data/models"
)

# Disable forced_decoder_ids to avoid automatic translation
model.config.forced_decoder_ids = None

gen_kwargs = {
    "language": "es",
    "task": "transcribe",
    "compression_ratio_threshold": 1.35,
    "return_timestamps": True,
}

inputs = processor(
    [c.numpy() for c in waveform_tensor],
    sampling_rate=16000,
    return_tensors="pt",
    padding="longest"
)

input_features = inputs.input_features.to(device)
generated_ids = model.generate(input_features, **gen_kwargs)
batch_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)

And I'm attaching the chunk that presents the problem.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment