Existing ASR models for Indic languages are biased toward studio and broadcast recordings and degrade on spontaneous speech. We address this on two fronts: Vividh-ASR, a benchmark that stratifies evaluation by acoustic complexity across four tiers, and a Whisper fine-tuning recipe that systematically improves robustness across all of them.
Most of the performance gain comes from one change: fine-tuning Whisper with a high learning rate (2e-4). This alone consistently outperforms existing public Hindi and Malayalam ASR models across all acoustic conditions.
Training on easier examples first — the standard curriculum approach — provides no benefit and often hurts. Training on harder conditions first helps for Malayalam, producing further gains on spontaneous and noisy speech; for Hindi, the high learning rate alone is already sufficient.
Together, these findings enable a 244M parameter Whisper model to outperform publicly available models up to six times its size on overall WER, without any architectural changes or proprietary data.