Benchmark Long-Context Architectures on a Legal-Doc Retrieval Task
Overview
What this challenge is about.
You receive a public legal-QA dataset (e.g., LongBench's legal split or similar) filtered to documents over 50,000 tokens. Implement or wrap 3 architectures: a sliding-window Transformer baseline, a Mamba-class state-space model, and a hybrid (e.g., Jamba-style). Fine-tune each on the same training split under a shared compute budget, then evaluate on the held-out test split for retrieval accuracy (top-1, top-5) and on long-needle-in-haystack synthetic probes. Write the 6-page technical report following the consultancy's existing style guide.
The Brief
What you'll do, and what you'll demonstrate.
Determine which long-context architecture family delivers the best accuracy/compute trade-off on real legal documents.
Earning criteria — what you'll demonstrate
- Reason about the long-context trade-off space across architecture families
- Implement a fair multi-architecture benchmark on a non-toy task
- Author a publishable technical report at conference quality
- Communicate architecture trade-offs to a non-research audience (lawyers)
Program Fit
Where this fits in your program.
Sharpens the same skills your degree expects you to demonstrate.
Skills
Skills you'll demonstrate.
Each one shows up on your verified credential.
Careers
Roles this prepares you for.
Real titles. Real skill bridges. Pick the one closest to your trajectory.
ML Researcher
Cross-architecture comparison with fair-protocol guarantees mirrors the first-year ML-researcher's evaluation discipline.
This challenge sharpens
- transformers
- state-space-models
- benchmarking
NLP Engineer
Hands-on long-context evaluation on real legal documents is a direct skill transfer to NLP engineering roles at legal-tech and enterprise-search companies.
This challenge sharpens
- long-context-architectures
- transformers
- pytorch