Long-Context QA Evaluation Benchmark for Legal Memoranda
Overview
What this challenge is about.
You receive 25 anonymized legal memoranda (50-90 pages each) and 100 QA pairs whose answers are deliberately spread across the documents (25 in pages 1-20, 25 in pages 20-40, 25 in pages 40-60, 25 in pages 60+). Run 3 long-context LLMs on the QA set, varying the position of the answer within the prompt to expose 'lost in the middle' effects. Score with strict F1 against gold spans, broken down by answer position. Deliver the benchmark code, a results report including a position-vs-accuracy plot, and a 2-page recommendation on which model variant to ship.
The Brief
What you'll do, and what you'll demonstrate.
Design a long-context QA benchmark that exposes position-dependent accuracy and recommend one model variant for production.
Earning criteria — what you'll demonstrate
- Design a long-context evaluation that captures position-dependent effects
- Apply strict F1 evaluation on long-document span extraction
- Reason about 'lost in the middle' in modern long-context LLMs
- Translate model-benchmark findings into a procurement recommendation
Program Fit
Where this fits in your program.
Sharpens the same skills your degree expects you to demonstrate.
Skills
Skills you'll demonstrate.
Each one shows up on your verified credential.
Careers
Roles this prepares you for.
Real titles. Real skill bridges. Pick the one closest to your trajectory.
ML Researcher
Designing a long-context benchmark that exposes position effects is genuine applied-research work and a strong interview portfolio piece.
This challenge sharpens
- benchmark-design
- long-context-qa
- experiment-design
Applied AI Scientist
Turning a model benchmark into a procurement recommendation for a CTO is exactly what applied AI scientists do at vertical AI startups.
This challenge sharpens
- model-evaluation
- long-context-qa
- experiment-design
AI Safety Researcher
Surfacing capability-failure modes like 'lost in the middle' is the kind of evaluation work safety researchers do on long-context systems.
This challenge sharpens
- lost-in-the-middle
- model-evaluation
- benchmark-design