Overview
What this challenge is about.
Use a public multilingual corpus (e.g., MultiEURLEX or a subset of EUR-Lex) plus a small hand-built test set of around 100 cross-lingual query-passage pairs. Fine-tune (or evaluate off-the-shelf) a multilingual sentence-embedding model (e.g., LaBSE, multilingual-e5). Build a retrieval pipeline with FAISS, then evaluate Recall@10 cross-lingually and qualitatively probe how the embedding handles legal-domain terms. Deliver a 4-page memo + a Streamlit demo for the product team.
The Brief
What you'll do, and what you'll demonstrate.
Build and evaluate a cross-lingual legal-passage retrieval system across EN/DE/FR/IT and demonstrate where distributional semantics helps or fails.
Earning criteria — what you'll demonstrate
- Apply multilingual sentence embeddings to a real retrieval task
- Evaluate retrieval with appropriate metrics and per-language slices
- Probe distributional-semantics behavior on legal-domain terms
- Communicate retrieval-quality trade-offs to a product audience
Program Fit
Where this fits in your program.
Sharpens the same skills your degree expects you to demonstrate.
Skills
Skills you'll demonstrate.
Each one shows up on your verified credential.
Careers
Roles this prepares you for.
Real titles. Real skill bridges. Pick the one closest to your trajectory.
NLP Engineer
Cross-lingual retrieval with multilingual embeddings is the day-one NLP engineering work at any legal-tech or multilingual product company.
This challenge sharpens
- multilingual-nlp
- sentence-embeddings
- information-retrieval
ML Researcher
Probing how distributional semantics handles domain-specific terms is the kind of empirical research a junior ML researcher publishes early in their career.
This challenge sharpens
- distributional-semantics
- evaluation
- sentence-embeddings
AI Engineer
Wrapping retrieval research as a working Streamlit demo for a product team is the AI-engineer-as-bridge role.
This challenge sharpens
- information-retrieval
- sentence-embeddings
- python