Skip to contentSkip to content
Verified credentials. On-chain. Forever.Learn more
Cover image for Long-Context QA Evaluation Benchmark for Legal Memoranda
Research

Long-Context QA Evaluation Benchmark for Legal Memoranda

FreeVerified credential3 weeksExpert

Overview

What this challenge is about.

You receive 25 anonymized legal memoranda (50-90 pages each) and 100 QA pairs whose answers are deliberately spread across the documents (25 in pages 1-20, 25 in pages 20-40, 25 in pages 40-60, 25 in pages 60+). Run 3 long-context LLMs on the QA set, varying the position of the answer within the prompt to expose 'lost in the middle' effects. Score with strict F1 against gold spans, broken down by answer position. Deliver the benchmark code, a results report including a position-vs-accuracy plot, and a 2-page recommendation on which model variant to ship.

CredentialBlockchain-anchored
ShareableLinkedIn-ready
LanguageEnglish
PaceSelf-paced

The Brief

What you'll do, and what you'll demonstrate.

Design a long-context QA benchmark that exposes position-dependent accuracy and recommend one model variant for production.

Earning criteria — what you'll demonstrate

  • Design a long-context evaluation that captures position-dependent effects
  • Apply strict F1 evaluation on long-document span extraction
  • Reason about 'lost in the middle' in modern long-context LLMs
  • Translate model-benchmark findings into a procurement recommendation

Program Fit

Where this fits in your program.

Sharpens the same skills your degree expects you to demonstrate.

Skills

Skills you'll demonstrate.

Each one shows up on your verified credential.

Careers

Roles this prepares you for.

Real titles. Real skill bridges. Pick the one closest to your trajectory.

ML Researcher

Designing a long-context benchmark that exposes position effects is genuine applied-research work and a strong interview portfolio piece.

This challenge sharpens

  • benchmark-design
  • long-context-qa
  • experiment-design

Applied AI Scientist

Turning a model benchmark into a procurement recommendation for a CTO is exactly what applied AI scientists do at vertical AI startups.

This challenge sharpens

  • model-evaluation
  • long-context-qa
  • experiment-design

AI Safety Researcher

Surfacing capability-failure modes like 'lost in the middle' is the kind of evaluation work safety researchers do on long-context systems.

This challenge sharpens

  • lost-in-the-middle
  • model-evaluation
  • benchmark-design

One more thing

You can put a credential on your CV by Friday.