Overview
What this challenge is about.
You receive 8,000 labeled preference pairs from real support conversations (each pair is two model responses with a human-chosen winner). Fine-tune a small open-weights base model (e.g., Llama-3.1-8B) into a reward model using a pairwise Bradley-Terry loss. Hold out 1,000 pairs for validation. Report (a) pairwise accuracy on the held-out pairs, (b) score-distribution diagnostics (no degenerate constant scores), and (c) per-category accuracy across 5 tagged categories (helpfulness, tone, refusals, hallucination, brevity). Success is overall accuracy above 72 percent with no category below 65 percent.
The Brief
What you'll do, and what you'll demonstrate.
Train a reward model on customer-support preference pairs that meets accuracy targets across all 5 categories.
Earning criteria — what you'll demonstrate
- Implement Bradley-Terry pairwise preference loss for reward modeling
- Fine-tune a base LLM as a reward model and validate it correctly
- Diagnose reward-model pathologies (degenerate scores, category gaps)
- Communicate reward-model methodology to a post-training team
Program Fit
Where this fits in your program.
Sharpens the same skills your degree expects you to demonstrate.
Skills
Skills you'll demonstrate.
Each one shows up on your verified credential.
Careers
Roles this prepares you for.
Real titles. Real skill bridges. Pick the one closest to your trajectory.
ML Researcher
Reward-model training is the entry point into RLHF research at every foundation-model lab hiring in 2024-25.
This challenge sharpens
- reward-modeling
- preference-learning
- bradley-terry-loss
AI Safety Researcher
Per-category diagnostics and degenerate-score detection are core alignment-team skills.
This challenge sharpens
- reward-modeling
- evaluation
- preference-learning
Research Scientist
Multi-seed reporting and methodology documentation are the rigor signals research-scientist roles screen for.
This challenge sharpens
- model-finetuning
- evaluation
- reward-modeling