Why does a tiny silence at the start of my audio change Whisper’s transcription?

#187

by dylanewbie - opened Apr 16, 2025

Apr 16, 2025

Hi everyone, I’m using OpenAI’s Whisper for speech recognition.
My audio says “ABC1234,” but sometimes the model outputs “AVC1234.” If I prepend a short silence (e.g., 10ms), it switches to the correct “ABC1234,” but increasing that silence (20ms, 30ms, 40ms, etc.) makes it flip back and forth between “ABC1234” and “AVC1234.”
Even replacing silence with white noise has the same effect.

Has anyone else run into this issue? Why does adding a tiny bit of audio cause such unpredictable changes in the transcription?
Any insights or suggestions would be really helpful!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment