Long-Context QA Evaluation Benchmark for Legal Memoranda

FreeVerified credential3 weeksExpert

Overview

What this challenge is about.

You receive 25 anonymized legal memoranda (50-90 pages each) and 100 QA pairs whose answers are deliberately spread across the documents (25 in pages 1-20, 25 in pages 20-40, 25 in pages 40-60, 25 in pages 60+). Run 3 long-context LLMs on the QA set, varying the position of the answer within the prompt to expose 'lost in the middle' effects. Score with strict F1 against gold spans, broken down by answer position. Deliver the benchmark code, a results report including a position-vs-accuracy plot, and a 2-page recommendation on which model variant to ship.

CredentialBlockchain-anchored

ShareableLinkedIn-ready

LanguageEnglish

PaceSelf-paced

The Brief

What you'll do, and what you'll demonstrate.

Design a long-context QA benchmark that exposes position-dependent accuracy and recommend one model variant for production.

Earning criteria — what you'll demonstrate

Design a long-context evaluation that captures position-dependent effects
Apply strict F1 evaluation on long-document span extraction
Reason about 'lost in the middle' in modern long-context LLMs
Translate model-benchmark findings into a procurement recommendation

Program Fit

Where this fits in your program.

Sharpens the same skills your degree expects you to demonstrate.

Question Answering and Conversational Systems

Master · Ai Ml

Fit score: 1

Skills

Skills you'll demonstrate.

Each one shows up on your verified credential.

Careers

Roles this prepares you for.

Real titles. Real skill bridges. Pick the one closest to your trajectory.

Career paths this builds toward

Canonical roles

Machine Learning Engineer
AI Engineering

ML Researcher

Designing a long-context benchmark that exposes position effects is genuine applied-research work and a strong interview portfolio piece.

This challenge sharpens

benchmark-design
long-context-qa
experiment-design

Applied AI Scientist

Turning a model benchmark into a procurement recommendation for a CTO is exactly what applied AI scientists do at vertical AI startups.

This challenge sharpens

model-evaluation
long-context-qa
experiment-design

AI Safety Researcher

Surfacing capability-failure modes like 'lost in the middle' is the kind of evaluation work safety researchers do on long-context systems.

This challenge sharpens

lost-in-the-middle
model-evaluation
benchmark-design

One more thing

You can put a credential on your CV by Friday.

Start this challenge