Build a Vector-Search Backend for an Enterprise AI Knowledge Assistant
Overview
What this challenge is about.
You receive a corpus of around 20,000 PDFs (mixed scanned and digital) totalling around 30 GB and a labeled retrieval set of 200 queries with human-judged ground-truth passages. Build the parsing-plus-chunking pipeline (text extraction, OCR fallback for scans, semantic chunking), an embedding pipeline using an open embedding model, and a hybrid (vector + BM25) retrieval API. Success is recall-at-10 above 0.85 on the labeled set, ingest throughput documented in pages-per-minute, and per-query latency under 300 milliseconds at p95.
The Brief
What you'll do, and what you'll demonstrate.
Build a RAG ingest-and-retrieval backend that hits recall-at-10 above 0.85 and p95 latency under 300 ms on an enterprise PDF corpus.
Earning criteria — what you'll demonstrate
- Design a chunking strategy informed by retrieval evaluation
- Operate an embedding pipeline at corpus scale
- Combine vector and lexical retrieval into a hybrid system
- Measure retrieval quality with standard metrics (recall@k, MRR)
Program Fit
Where this fits in your program.
Sharpens the same skills your degree expects you to demonstrate.
Skills
Skills you'll demonstrate.
Each one shows up on your verified credential.
Careers
Roles this prepares you for.
Real titles. Real skill bridges. Pick the one closest to your trajectory.
AI Engineer
Building production-grade RAG retrieval backends is the single most common AI-engineer job description right now; this challenge ships the load-bearing piece.
This challenge sharpens
- rag
- vector-search
- embeddings
Data Engineer
Corpus-scale ingest with parsing fallbacks and resumability is core data-engineering work that supports any RAG or search team.
This challenge sharpens
- document-parsing
- python
- retrieval-evaluation
Machine Learning Engineer
Owning the retrieval-evaluation harness with recall@k and MRR mirrors how MLEs run model evals at scale.
This challenge sharpens
- retrieval-evaluation
- embeddings
- vector-search