ML Researcher - Speech
Remote
Adalat AI is building an end-to-end justice tech stack that automates manual and clerical pain points in courtrooms, giving judges back time to focus on what matters most: decision-making and delivering justice. Our solutions - from AI-powered transcription in Indian languages to case-flow management and document navigation - are now deployed across 9 states, covering nearly 20% of India’s judiciary. Backed by leading technology companies and funders, and incubated at MIT and Oxford, Adalat AI is working to eliminate judicial delays and expand access to timely justice. Founded by a team with backgrounds in law, technology, and economics from Harvard, Oxford, MIT, and IIIT Hyderabad, we are scaling rapidly across India and the Global South.
Role Overview
Indian courts produce thousands of hours of audio every day. That audio is noisy, multi-speaker, multi-language, and unlike anything a standard ASR system was trained on. A judge will dictate in Hindi and slip into legal English. A lawyer will respond in a dialect of Kannada specific to a particular district. Two speakers will overlap. The session will run for six hours in a room with a ceiling fan and no acoustic treatment. WER on a clean benchmark tells you almost nothing about whether your model will work in that room.
In this role, you'll own Adalat's acoustic and speech modeling agenda: what architectures, what training data, what failure modes matter, and how to actually measure them. The work spans ASR for Indic legal speech, diarization, speaker identification, speech-text alignment, and noise robustness. You'll define evaluation methodology that goes beyond standard metrics because the standard metrics aren't enough. And you'll do it in one of the most linguistically diverse, data-scarce, acoustically challenging domains in applied speech research.
This is a Research Scientist role. Your primary output is modelling judgment, experimental insight, and reproducible work. You'll work closely with engineers who take your models and make them deployable in real courtroom workflows.
Key Responsibilities
1. ASR development for Indic legal speech
Fine-tune and adapt state-of-the-art ASR systems (Whisper, wav2vec2, Conformer/Parakeet, or successors) for Indian legal speech.
Model multi-dialect and code-switched speech across 10+ Indian languages.
Build robustness to real courtroom acoustic conditions: noise, reverberation, simultaneous speakers, variable microphone quality.
2. Diarization, speaker identification, and speech-text alignment
Build diarization systems that can reliably separate judges, lawyers, and witnesses in multi-speaker courtroom proceedings.
Develop speaker identification for consistent labeling across long sessions and multiple sittings.
Build speech-text alignment capabilities that make transcripts usable for downstream NLP tasks.
3. Evaluation methodology for courtroom speech
Define the metrics and test sets that capture what matters in legal transcription — beyond WER to what errors are actually consequential.
Build evaluation suites across acoustic conditions, dialects, language pairs, and court tiers.
Conduct structured failure analyses: not just where the model is wrong, but why, and what it would cost to fix.
4. Data curation and research dissemination
Build labeled speech corpora from real courtroom recordings across languages and court environments.
Design annotation pipelines for ASR, diarization, and speaker ID that are high-quality and scalable.
Document experiments clearly; publish externally where the work warrants it.
Qualifications
Must have
2–6 years in speech research: ASR, diarization, speaker recognition, or closely related areas.
Hands-on experience fine-tuning and evaluating modern end-to-end ASR systems.
Strong Python; proficiency with audio processing libraries (librosa, torchaudio, or equivalent).
Solid understanding of acoustic modeling fundamentals.
Experience designing evaluations that produce insight, not just accuracy numbers.
Strong plus
Publications at Interspeech, ICASSP, NeurIPS, ICLR, or comparable venues — a strong signal.
Experience with Indic speech (ASR or TTS for any Indian language).
Prior work on diarization, speaker identification, or speech-text alignment.
Experience with low-resource ASR: data augmentation, semi-supervised, or self-supervised methods.
What You Will Achieve in a Year
You'll have ASR models running in production across thousands of Indian courtrooms, outperforming general-purpose baselines on the evaluation sets that actually reflect field conditions. You'll have a diarization system that reliably separates speakers across long, multi-speaker sessions. You'll have defined an evaluation methodology for courtroom speech that captures what WER doesn't — and made the case for why that distinction matters. You'll have at least one paper in review. And the team will measure speech quality the way you defined it.
Benefits and Perks
WFH with flexible work hours.
Unlimited PTO.
Contacts within the Harvard / MIT/ Oxford ecosystem.
Autonomy and Ownership
Smart, Humble and Friendly peers
Generous vacation
Maternity and Paternity leaves
Learning & Development resources