Lab Project: Compare Three Architectures on Your Own Mini-Benchmark
Overview
What this challenge is about.
Scope the problem yourself (suggested examples: sentiment classification on a niche domain, tabular anomaly detection, time-series forecasting on a public dataset). Define the train/val/test split AND a held-out distribution-shift evaluation. Implement three architectures from different families (e.g., MLP + transformer + gradient-boosted trees) with shared hyperparameter budget and 5 random seeds each. Report mean + 95% confidence interval per metric, plus a paired statistical test between the top two. Write a 4-page lab report in NeurIPS-style format.
The Brief
What you'll do, and what you'll demonstrate.
Design and run a fair three-architecture mini-benchmark with honest statistical reporting and a written lab report.
Earning criteria — what you'll demonstrate
- Design a fair benchmark across architecture families
- Apply statistical testing to ML results (no single-seed claims)
- Distinguish in-distribution from distribution-shift performance
- Write a publication-style lab report
Program Fit
Where this fits in your program.
Sharpens the same skills your degree expects you to demonstrate.
Skills
Skills you'll demonstrate.
Each one shows up on your verified credential.
Careers
Roles this prepares you for.
Real titles. Real skill bridges. Pick the one closest to your trajectory.
ML Researcher
Designing fair benchmarks and reporting wins with confidence intervals is the daily hygiene of a junior ML researcher, especially at labs that take reproducibility seriously.
This challenge sharpens
- experiment-design
- statistical-testing
- benchmarking
Research Scientist
Multi-seed runs, paired statistical tests, and workshop-style writing mirror the rigor expected from a research scientist's first ablation study.
This challenge sharpens
- statistical-testing
- scientific-writing
- experiment-design
Applied AI Scientist
The discipline of distribution-shift evaluation translates directly to applied AI work where deployment data never matches training data.
This challenge sharpens
- benchmarking
- pytorch
- deep-learning