Overview
What this challenge is about.
You receive a base instruction-tuned model checkpoint plus 2,500 preference pairs from editorial reviews (each pair: two grant-application paragraphs, the editor-preferred winner labeled). Run DPO on a small open-weights base model (around 7-13B). Hold out 300 pairs for evaluation. Score on (a) DPO held-out accuracy, (b) head-to-head win rate against the base model judged by 3 in-house editors on 50 fresh prompts, and (c) a sanity capability check on MMLU-style questions (DPO should not crater general capability by more than 2 points). Success is DPO accuracy above 68 percent, win rate above 60 percent, capability drop within 2 points.
The Brief
What you'll do, and what you'll demonstrate.
Run DPO on a fundraising-writing model to beat the base model in editor-blind preference without cratering general capability.
Earning criteria — what you'll demonstrate
- Implement DPO training with the TRL library
- Design and run an editor-blind head-to-head win-rate study
- Detect capability regressions during preference fine-tuning
- Communicate post-training results to a non-ML founder
Program Fit
Where this fits in your program.
Sharpens the same skills your degree expects you to demonstrate.
Skills
Skills you'll demonstrate.
Each one shows up on your verified credential.
Careers
Roles this prepares you for.
Real titles. Real skill bridges. Pick the one closest to your trajectory.
ML Researcher
Running DPO on a real product with a proper win-rate study is the day-one job of post-training researchers at every AI startup shipping fine-tuned models.
This challenge sharpens
- dpo
- preference-learning
- win-rate-eval
AI Engineer
Wiring DPO training + capability checks + win-rate harness into a reusable pipeline is core AI-engineer work at fine-tuning shops.
This challenge sharpens
- model-finetuning
- evaluation
- pytorch
Applied AI Scientist
Translating preference data into shipping a product release with measured wins and capability checks is exactly the applied-AI-scientist craft.
This challenge sharpens
- dpo
- win-rate-eval
- evaluation