ML Engineer - Runtime | Careers

About the Company

Adalat AI is building an end-to-end justice tech stack that automates manual and clerical pain points in courtrooms, giving judges back time to focus on what matters most: decision-making and delivering justice. Our solutions - from AI-powered transcription in Indian languages to case-flow management and document navigation - are now deployed across 9 states, covering nearly 20% of India’s judiciary. Backed by leading technology companies and funders, and incubated at MIT and Oxford, Adalat AI is working to eliminate judicial delays and expand access to timely justice. Founded by a team with backgrounds in law, technology, and economics from Harvard, Oxford, MIT, and IIIT Hyderabad, we are scaling rapidly across India and the Global South.

Role Overview

Most ML systems are built for controlled environments. Adalat's are not. Our models run across 5,000+ courtrooms — from the Supreme Court to remote district courts with unreliable connectivity — processing 8–12 hours of live audio every day. A model that works in a data centre has to survive rooms with ceiling fans, simultaneous speakers in three languages, and hardware that hasn't been refreshed in a decade. That gap between the lab and the courtroom is what this role owns.

In this role, you'll be responsible for making our speech and language models fast, cost-effective, and deployable at scale on real-world hardware. You'll work at the intersection of model architecture and systems engineering — applying compression techniques, building compiler-aware inference pipelines, and designing serving infrastructure that holds up under the most demanding conditions in Indian legal AI. This is not model-agnostic infrastructure work: you'll collaborate closely with our Speech and NLP researchers to understand what each model does and find the optimisations that preserve what matters while cutting what doesn't.

Key Responsibilities

1. Optimise inference pipelines

Apply quantisation (PTQ, QAT), pruning, and distillation to reduce model size and latency without meaningful quality loss.
Build compiler-aware workflows using MLIR, LLVM, or ONNX to squeeze performance from diverse CPU and GPU targets.
Profile, benchmark, and continuously improve throughput across the full ML stack.

2. Build and maintain scalable serving infrastructure

Design async batching, concurrency, and load-balancing strategies for high-throughput, real-time audio processing.
Architect deployment pipelines that span cloud and edge environments — including constrained, low-bandwidth courtroom settings.
Own cost efficiency: understanding exactly where compute is being spent and driving it down without degrading output quality.

3. Debug and harden production systems

Stress-test inference across thousands of hours of real courtroom audio and documents.
Build diagnostic tooling for runtime failures — latency spikes, memory pressure, model drift in noisy acoustic environments.
Create monitoring and alerting infrastructure that catches regressions before users do.

4. Partner with Research and Product

Translate system constraints back into model architecture recommendations for Speech and NLP researchers.
Work with the Product ML team to define latency and cost targets that are both technically achievable and user-meaningful.
Contribute to model design decisions early — before a model is trained in a way that makes efficient deployment impossible.

Qualifications

Must have

3–6 years in ML optimisation, inference engineering, model compression, or compiler development.
Strong Python and C/C++ programming.
Hands-on experience with quantisation, distillation, or pruning in production.
Proficiency with deep learning frameworks (PyTorch or TensorFlow) and at least one compiler framework (LLVM, MLIR, ONNX, TensorRT).
Experience deploying ML models in resource-constrained or latency-sensitive environments.

Strong plus

Experience with advanced LLM serving strategies (vLLM, continuous batching, speculative decoding).
Prior work with speech pipelines (ASR, diarization) at the systems level.
Experience on edge or embedded hardware targets.
Publications at systems or ML venues (MLSys, OSDI, NeurIPS, ICML, Interspeech, or similar).

What You Will Achieve in a Year

In your first year, you'll have optimised Adalat's end-to-end ML stack for deployment across 5,000+ courtrooms running real-time speech and language processing for 10+ hours daily. You'll have hit concrete latency and cost targets that made previously impractical deployments viable. You'll have built runtime monitoring that means the team finds out about inference regressions before judges do. The researchers will think of you not just as the person who makes their models faster, but as someone who shapes how they design models in the first place.