Overview
What this challenge is about.
You receive 2,000 labeled code snippets (human rater consensus score 1-5) and budget for at most 8,000 API calls across the optimization run. Run a factorial sweep of 3 prompt strategies x 3 output schemas x 2 model tiers = 18 configurations on a 500-snippet subset, then evaluate the top 3 configurations on the full 2,000-snippet set. Compute Spearman correlation with human rater scores and per-call cost. Pick the Pareto frontier (cost vs. correlation) and recommend one configuration. Success is a recommendation that hits at least 40 percent cost reduction with correlation drop under 2 points.
The Brief
What you'll do, and what you'll demonstrate.
Find a Pareto-optimal prompt + model configuration that cuts spend by 40 percent on a 2M-call/week scoring pipeline without losing human-agreement quality.
Earning criteria — what you'll demonstrate
- Design a factorial prompt + model experiment under a fixed call budget
- Quantify the cost-quality trade-off rigorously (correlation, CIs, cost-per-call)
- Choose between prompt strategies (zero-shot, few-shot, CoT) on evidence
- Communicate optimization findings to an infrastructure-review audience
Program Fit
Where this fits in your program.
Sharpens the same skills your degree expects you to demonstrate.
Skills
Skills you'll demonstrate.
Each one shows up on your verified credential.
Careers
Roles this prepares you for.
Real titles. Real skill bridges. Pick the one closest to your trajectory.
Prompt Engineer
Running structured prompt + model sweeps under a real production budget is exactly what senior prompt engineers do at AI-heavy companies.
This challenge sharpens
- prompt-optimization
- cost-quality-tradeoff
- experiment-design
MLOps Engineer
Owning the optimization + monitoring loop on a 2M-call/week pipeline is MLOps-engineer territory at any company spending serious money on LLM APIs.
This challenge sharpens
- cost-quality-tradeoff
- experiment-design
- evaluation
Applied AI Scientist
Factorial experiment design and rigorous cost-quality reporting is the rigor applied AI scientists bring to internal-tooling decisions.
This challenge sharpens
- experiment-design
- evaluation
- ab-testing