Overview
What this challenge is about.
Choose one open LLM benchmark (e.g., MMLU, GPQA, BIG-Bench-Hard, MATH). Read the benchmark paper plus at least three follow-up critiques. Audit (1) data contamination risk against common pretraining corpora; (2) label quality on a sampled subset of around 100 items, with re-labeling by you and one peer; (3) coverage gaps relative to the benchmark's claimed construct. Quantify inter-annotator agreement (Cohen's kappa) on your re-labeling. Produce a 6-page report with concrete recommendations and a 1-page summary suitable for the benchmark maintainers' GitHub issue.
The Brief
What you'll do, and what you'll demonstrate.
Audit one prominent open LLM benchmark for validity threats and publish a structured, citable report with recommendations.
Earning criteria — what you'll demonstrate
- Identify the main validity threats to LLM benchmarks
- Run a small structured re-labeling exercise with kappa statistics
- Detect plausible data contamination paths in a public benchmark
- Write a constructive audit report engineers will actually act on
Program Fit
Where this fits in your program.
Sharpens the same skills your degree expects you to demonstrate.
Skills
Skills you'll demonstrate.
Each one shows up on your verified credential.
Careers
Roles this prepares you for.
Real titles. Real skill bridges. Pick the one closest to your trajectory.
Career paths this builds toward
Canonical rolesAI Safety Researcher
Independent benchmark audits with proper kappa statistics are a recognizable AI safety research contribution and a direct hiring signal.
This challenge sharpens
- benchmark-evaluation
- data-contamination-analysis
- llm-evaluation
Research Scientist
Designing a re-labeling exercise with inter-annotator statistics is the research scientist's first-week deliverable inside an eval-focused lab.
This challenge sharpens
- annotation-methodology
- inter-annotator-agreement
- research-writing
ML Researcher
Understanding benchmark validity threats is foundational for any ML researcher choosing what to optimize against.
This challenge sharpens
- benchmark-evaluation
- llm-evaluation
- research-writing