Data Mining Associate
Remote
Development
About the Company
Adalat AI is building an end-to-end justice tech stack that automates manual and clerical pain points in courtrooms, giving judges back time to focus on what matters most: decision-making and delivering justice. Our solutions - from AI-powered transcription in Indian languages to case-flow management and document navigation - are now deployed across 9 states, covering nearly 20% of India’s judiciary. Backed by leading technology companies and funders, and incubated at MIT and Oxford, Adalat AI is working to eliminate judicial delays and expand access to timely justice. Founded by a team with backgrounds in law, technology, and economics from Harvard, Oxford, MIT, and IIIT Hyderabad, we are scaling rapidly across India and the Global South.
Role Overview
We’re hiring a Data Mining Associate to power the backbone of our Legal Data Intelligence systems. In this role, you’ll work at the intersection of raw data acquisition and ML engineering, transforming messy, unstructured sources into high-quality, consumable datasets optimised for Large Language Models (LLMs).
You’ll collaborate closely with ML engineers, responding to data requests and building reliable pipelines that ensure they always have access to clean, structured, and context-rich data for training and evaluation.
This isn’t just scraping — it’s about engineering scalable data mining workflows that fuel next-gen AI.
Key Responsibilities
Partner with ML engineers to understand data needs and design mining pipelines that deliver training-ready datasets.
Collect, clean, and normalise unstructured data from legal, regulatory, and public sources into structured formats tailored for LLM workflows.
Maintain and optimise Python-based data mining systems that handle real-world data complexity.
Monitor pipeline health, anticipate changes in source structures, and build resilience against disruptions.
Document data lineage and transformation processes to ensure traceability, reproducibility, and quality control.
Support annotation and feedback loops, enabling iterative improvements to model performance.
Qualifications
Have strong Python and data engineering skills with a hacker’s mindset for extracting insights from difficult sources.
Understand how to make raw data consumable for ML/LLMs, not just collected.
Are comfortable with HTML parsing, selectors, APIs, and dynamic content handling.
Can navigate real-world constraints: proxies, captchas, rate limits, messy edge cases.
Bonus:Have worked with tools like BeautifulSoup, Scrapy, Selenium, or Playwright for mining.
Familiarity with LLM data preprocessing (tokenization, deduplication, filtering) and/or cloud data orchestration.
What You Will Achieve in a Year
Built and owned production-grade data mining pipelines that reliably feed critical legal datasets into LLM training workflows.
Transformed millions of raw, unstructured documents into clean, structured, and labeled datasets consumed directly by our ML stack.
Established data quality and lineage standards that allow ML engineers to trace every token in a model back to its source.
Designed reusable data preprocessing components (deduplication, filtering, normalization) that improve model performance at scale.
Partnered closely with ML engineers to shape the feedback loop between mined data and model evaluation, helping steer product-level AI behavior.
Benefits and Perks
WFH with flexible work hours.
Unlimited PTO.
Contacts within the Harvard / MIT/ Oxford ecosystem.
Autonomy and Ownership
Smart, Humble and Friendly peers
Generous vacation
Maternity and Paternity leaves
Learning & Development resources
Know more about Adalat AI
Join Our Team
To apply, please send your resume and a cover letter with the subject line: "Data Mining Engineer".