Audit a Public LLM Benchmark for Validity Threats

FreeVerified credential3 weeksAdvanced

Overview

What this challenge is about.

Audit an LLM benchmark for data contamination and label quality, re-label 100 items with a peer, and report findings. Earn a verifiable certificate.

CredentialBlockchain-anchored

ShareableLinkedIn-ready

LanguageEnglish

PaceSelf-paced

The Brief

Audit one prominent open LLM benchmark for validity threats and publish a structured, citable report with recommendations.

Program Fit

Sharpens the same skills your degree expects you to demonstrate.

Master · Ai Ml

Fit score: 1

Skills

Each one shows up on your verified credential.

Careers

Real titles. Real skill bridges. Pick the one closest to your trajectory.

Canonical roles

Independent benchmark audits with proper kappa statistics are a recognizable AI safety research contribution and a direct hiring signal.

This challenge sharpens

Designing a re-labeling exercise with inter-annotator statistics is the research scientist's first-week deliverable inside an eval-focused lab.

This challenge sharpens

Understanding benchmark validity threats is foundational for any ML researcher choosing what to optimize against.

This challenge sharpens

One more thing