Data Mining Associate
Remote
Development
About the Company
Adalat AI is reimagining India's judicial infrastructure through cutting-edge AI. From courtroom transcription to case summarization, we're building the end-to-end justice tech stack that powers faster, fairer courts — just like UPI did for payments and Aadhaar for identity.
We've deployed our speech recognition and legal understanding tools in 10 states, with recent launches at the Delhi High Court and partnerships backed by leading global foundations. Our founding team blends deep experience across law, AI, and NLP — with alumni from Harvard, MIT, Oxford, and IIIT-H — and we've been recognized by the world's top accelerators and competitions for our impact.
Role Overview
We’re hiring a Data Mining Associate to power the backbone of our Legal Data Intelligence systems. In this role, you’ll work at the intersection of raw data acquisition and ML engineering, transforming messy, unstructured sources into high-quality, consumable datasets optimised for Large Language Models (LLMs).
You’ll collaborate closely with ML engineers, responding to data requests and building reliable pipelines that ensure they always have access to clean, structured, and context-rich data for training and evaluation.
This isn’t just scraping — it’s about engineering scalable data mining workflows that fuel next-gen AI.
Key Responsibilities
Partner with ML engineers to understand data needs and design mining pipelines that deliver training-ready datasets.
Collect, clean, and normalise unstructured data from legal, regulatory, and public sources into structured formats tailored for LLM workflows.
Maintain and optimise Python-based data mining systems that handle real-world data complexity.
Monitor pipeline health, anticipate changes in source structures, and build resilience against disruptions.
Document data lineage and transformation processes to ensure traceability, reproducibility, and quality control.
Support annotation and feedback loops, enabling iterative improvements to model performance.
Qualifications
Have strong Python and data engineering skills with a hacker’s mindset for extracting insights from difficult sources.
Understand how to make raw data consumable for ML/LLMs, not just collected.
Are comfortable with HTML parsing, selectors, APIs, and dynamic content handling.
Can navigate real-world constraints: proxies, captchas, rate limits, messy edge cases.
Bonus:Have worked with tools like BeautifulSoup, Scrapy, Selenium, or Playwright for mining.
Familiarity with LLM data preprocessing (tokenization, deduplication, filtering) and/or cloud data orchestration.
What You Will Achieve in a Year
Built and owned production-grade data mining pipelines that reliably feed critical legal datasets into LLM training workflows.
Transformed millions of raw, unstructured documents into clean, structured, and labeled datasets consumed directly by our ML stack.
Established data quality and lineage standards that allow ML engineers to trace every token in a model back to its source.
Designed reusable data preprocessing components (deduplication, filtering, normalization) that improve model performance at scale.
Partnered closely with ML engineers to shape the feedback loop between mined data and model evaluation, helping steer product-level AI behavior.
Benefits and Perks
WFH with flexible work hours.
Unlimited PTO.
Contacts within the Harvard / MIT/ Oxford ecosystem.
Autonomy and Ownership
Smart, Humble and Friendly peers
Generous vacation
Maternity and Paternity leaves
Learning & Development resources
Know more about Adalat AI
Join Our Team
To apply, please send your resume and a cover letter with the subject line: "Data Mining Engineer".