Data Mining Associate

About the Company

Adalat AI is building an end-to-end justice tech stack that automates manual and clerical pain points in courtrooms, giving judges back time to focus on what matters most: decision-making and delivering justice. Our solutions - from AI-powered transcription in Indian languages to case-flow management and document navigation - are now deployed across 9 states, covering nearly 20% of India’s judiciary. Backed by leading technology companies and funders, and incubated at MIT and Oxford, Adalat AI is working to eliminate judicial delays and expand access to timely justice. Founded by a team with backgrounds in law, technology, and economics from Harvard, Oxford, MIT, and IIIT Hyderabad, we are scaling rapidly across India and the Global South.

Role Overview

We’re hiring a Data Mining Associate to power the backbone of our Legal Data Intelligence systems. In this role, you’ll work at the intersection of raw data acquisition and ML engineering, transforming messy, unstructured sources into high-quality, consumable datasets optimised for Large Language Models (LLMs).

You’ll collaborate closely with ML engineers, responding to data requests and building reliable pipelines that ensure they always have access to clean, structured, and context-rich data for training and evaluation.

This isn’t just scraping — it’s about engineering scalable data mining workflows that fuel next-gen AI.

Key Responsibilities

Partner with ML engineers to understand data needs and design mining pipelines that deliver training-ready datasets.
Collect, clean, and normalise unstructured data from legal, regulatory, and public sources into structured formats tailored for LLM workflows.
Maintain and optimise Python-based data mining systems that handle real-world data complexity.
Monitor pipeline health, anticipate changes in source structures, and build resilience against disruptions.
Document data lineage and transformation processes to ensure traceability, reproducibility, and quality control.
Support annotation and feedback loops, enabling iterative improvements to model performance.

Qualifications

Have strong Python and data engineering skills with a hacker’s mindset for extracting insights from difficult sources.
Understand how to make raw data consumable for ML/LLMs, not just collected.
Are comfortable with HTML parsing, selectors, APIs, and dynamic content handling.
Can navigate real-world constraints: proxies, captchas, rate limits, messy edge cases.

Bonus:
Have worked with tools like BeautifulSoup, Scrapy, Selenium, or Playwright for mining.
Familiarity with LLM data preprocessing (tokenization, deduplication, filtering) and/or cloud data orchestration.

What You Will Achieve in a Year

Built and owned production-grade data mining pipelines that reliably feed critical legal datasets into LLM training workflows.
Transformed millions of raw, unstructured documents into clean, structured, and labeled datasets consumed directly by our ML stack.
Established data quality and lineage standards that allow ML engineers to trace every token in a model back to its source.
Designed reusable data preprocessing components (deduplication, filtering, normalization) that improve model performance at scale.
Partnered closely with ML engineers to shape the feedback loop between mined data and model evaluation, helping steer product-level AI behavior.