Skip to contentSkip to content
Verified credentials. On-chain. Forever.Learn more
Cover image for Audit a Public LLM Benchmark for Validity Threats
Research

Audit a Public LLM Benchmark for Validity Threats

FreeVerified credential3 weeksAdvanced

Overview

What this challenge is about.

Choose one open LLM benchmark (e.g., MMLU, GPQA, BIG-Bench-Hard, MATH). Read the benchmark paper plus at least three follow-up critiques. Audit (1) data contamination risk against common pretraining corpora; (2) label quality on a sampled subset of around 100 items, with re-labeling by you and one peer; (3) coverage gaps relative to the benchmark's claimed construct. Quantify inter-annotator agreement (Cohen's kappa) on your re-labeling. Produce a 6-page report with concrete recommendations and a 1-page summary suitable for the benchmark maintainers' GitHub issue.

CredentialBlockchain-anchored
ShareableLinkedIn-ready
LanguageEnglish
PaceSelf-paced

The Brief

What you'll do, and what you'll demonstrate.

Audit one prominent open LLM benchmark for validity threats and publish a structured, citable report with recommendations.

Earning criteria — what you'll demonstrate

  • Identify the main validity threats to LLM benchmarks
  • Run a small structured re-labeling exercise with kappa statistics
  • Detect plausible data contamination paths in a public benchmark
  • Write a constructive audit report engineers will actually act on

Program Fit

Where this fits in your program.

Sharpens the same skills your degree expects you to demonstrate.

Careers

Roles this prepares you for.

Real titles. Real skill bridges. Pick the one closest to your trajectory.

Career paths this builds toward

Canonical roles

AI Safety Researcher

Independent benchmark audits with proper kappa statistics are a recognizable AI safety research contribution and a direct hiring signal.

This challenge sharpens

  • benchmark-evaluation
  • data-contamination-analysis
  • llm-evaluation

Research Scientist

Designing a re-labeling exercise with inter-annotator statistics is the research scientist's first-week deliverable inside an eval-focused lab.

This challenge sharpens

  • annotation-methodology
  • inter-annotator-agreement
  • research-writing

ML Researcher

Understanding benchmark validity threats is foundational for any ML researcher choosing what to optimize against.

This challenge sharpens

  • benchmark-evaluation
  • llm-evaluation
  • research-writing

One more thing

You can put a credential on your CV by Friday.