Benchmark Reward-from-Feedback Methods on a Tabletop Pick-Place
Overview
What this challenge is about.
You will use a Franka Panda arm in PyBullet on a 4-object pick-and-place task. For each of the three feedback methods, train a reward model and a downstream policy until convergence or a 6-hour budget. Use a scripted oracle for the bulk of feedback (cheap), then run a 4-person pilot to estimate real human-operator time per feedback unit. Report sample efficiency (return vs. queries), policy quality (success rate), and operator burden (seconds per feedback). Deliver a 6-page benchmark note plus code and trained checkpoints.
The Brief
What you'll do, and what you'll demonstrate.
Rank three reward-from-feedback methods on sample efficiency, policy quality, and operator burden on a single, controlled task.
Earning criteria — what you'll demonstrate
- Implement and compare reward-from-feedback methods in a controlled task
- Design a benchmark that fairly compares methods despite different feedback shapes
- Quantify operator burden alongside policy quality
- Write an internal research note appropriate for a lab audience
Program Fit
Where this fits in your program.
Sharpens the same skills your degree expects you to demonstrate.
Skills
Skills you'll demonstrate.
Each one shows up on your verified credential.
Careers
Roles this prepares you for.
Real titles. Real skill bridges. Pick the one closest to your trajectory.
Research Scientist
Owning a controlled benchmark across feedback methods and writing the internal note is the entry-level work of a research scientist at an AI lab.
This challenge sharpens
- reward-learning
- preference-comparison
- experiment-design
ML Researcher
Sample-efficiency reporting with multiple seeds and honest caveats is the methodological core of ML research.
This challenge sharpens
- reinforcement-learning
- benchmarking
- experiment-design
AI Safety Researcher
Reward-from-feedback methods sit squarely in alignment-and-safety research; this benchmark gives the student a credible safety-research artefact.
This challenge sharpens
- reward-learning
- preference-comparison
- benchmarking