Build a BM25 + Embeddings Hybrid Search for a Legal-Tech Document Portal
Overview
What this challenge is about.
Stand up an OpenSearch cluster with BM25 indexing on the 2.4M-document corpus. Generate dense embeddings (you choose the model; justify cost and quality trade-offs) and index them in a vector store (OpenSearch's k-NN module is acceptable). Implement a hybrid retriever using reciprocal rank fusion. Curate a 500-query relevance-judgment set with 3 CS team members over 2 weeks. Evaluate BM25, dense-only, and hybrid on MRR (mean reciprocal rank) and Recall@10. Ship the winner behind a feature flag to 10 percent of users for a 1-week telemetry sniff. Deliver code, the eval report, and the rollout plan.
The Brief
What you'll do, and what you'll demonstrate.
Ship a hybrid BM25-plus-embeddings retrieval system that beats BM25-only on MRR and Recall@10 on a curated 500-query relevance set.
Earning criteria — what you'll demonstrate
- Implement BM25 and dense retrieval and combine them via reciprocal rank fusion
- Curate a relevance-judgment set without burning out the CS partners
- Evaluate retrieval on MRR + Recall and pick a winner defensibly
- Roll out behind a feature flag and read telemetry honestly
Program Fit
Where this fits in your program.
Sharpens the same skills your degree expects you to demonstrate.
Skills
Skills you'll demonstrate.
Each one shows up on your verified credential.
Careers
Roles this prepares you for.
Real titles. Real skill bridges. Pick the one closest to your trajectory.
Career mappings coming soon.