Run a Human-Preference Study Comparing Two Coding Assistants
Overview
What this challenge is about.
Design a blinded paired-comparison study: 12 developer participants, each gets the same 8 realistic coding tasks (refactor, write a function, debug, test), each task is solved by both assistants, participants choose preferred output. Randomize assistant order and pre-register a primary outcome (proportion preferring Assistant A) plus a sample-size justification. Analyze with a paired binomial test and a small Bayesian alternative. Report effect size with confidence intervals. Produce a 5-page report plus a 30-minute founder briefing.
The Brief
What you'll do, and what you'll demonstrate.
Run a pre-registered human-preference study comparing two coding assistants and produce a vendor-decision recommendation.
Earning criteria — what you'll demonstrate
- Design a pre-registered human-preference study
- Justify sample size before collecting data
- Analyze paired-comparison data with frequentist and Bayesian methods
- Present a vendor decision under statistical uncertainty
Program Fit
Where this fits in your program.
Sharpens the same skills your degree expects you to demonstrate.
Skills
Skills you'll demonstrate.
Each one shows up on your verified credential.
Careers
Roles this prepares you for.
Real titles. Real skill bridges. Pick the one closest to your trajectory.
Applied AI Scientist
Designing a pre-registered evaluation for a real vendor decision is the applied AI scientist's contribution to product orgs.
This challenge sharpens
- experiment-design
- statistical-evaluation
- llm-evaluation
Data Scientist
Paired-comparison analysis with honest uncertainty is bread-and-butter data-scientist craft.
This challenge sharpens
- statistical-evaluation
- experiment-design
- pre-registration
AI Product Manager
Turning an evaluation into a defensible vendor decision is exactly the AI PM's contribution to the procurement conversation.
This challenge sharpens
- stakeholder-communication
- human-evaluation
- llm-evaluation