Embodied Visual Reasoning for a Warehouse Pick Assistant
Overview
What this challenge is about.
Use an embodied simulator (Habitat 3.0 or Isaac Sim — pick one and justify) to render 300 cluttered-bin scenarios with a target item label. For each scenario, build two reasoning modules: (1) a heuristic baseline that picks the topmost unoccluded object closest to the target, (2) a vision-language reasoner that takes the RGB-D view + the target label and outputs the recommended pick with a short explanation. Use an open VLM (PaliGemma, Qwen-VL, or LLaVA) with no fine-tuning required. Evaluate: pick-correctness rate against simulator ground truth, explanation quality (3-point rubric judged by a peer), and latency. Document where each method fails. Success: VLM pipeline beats heuristic by at least 15 percentage points on pick-correctness on a 100-scenario holdout.
The Brief
What you'll do, and what you'll demonstrate.
Show whether an open vision-language model can outperform a heuristic baseline at high-level pick-order reasoning in cluttered warehouse bins.
Earning criteria — what you'll demonstrate
- Use an embodied simulator to construct a controlled visual-reasoning evaluation
- Apply an open vision-language model as a high-level reasoning module
- Compare an LLM-based reasoner against a strong heuristic baseline fairly
- Translate research findings into a sprint-ready research question list
Program Fit
Where this fits in your program.
Sharpens the same skills your degree expects you to demonstrate.
Skills
Skills you'll demonstrate.
Each one shows up on your verified credential.
Careers
Roles this prepares you for.
Real titles. Real skill bridges. Pick the one closest to your trajectory.
Research Scientist
Designing a controlled simulator benchmark, running a fair model-vs-baseline comparison, and producing a research-question list is the daily texture of a junior research scientist's job in any robotics lab.
This challenge sharpens
- embodied-vision
- visual-reasoning
- evaluation
ML Researcher
Applying a vision-language model as a high-level reasoning module and characterizing its failure modes is precisely the kind of applied-research work ML researchers ship for embodied-AI startups.
This challenge sharpens
- vision-language-models
- visual-reasoning
- simulation
Computer Vision Engineer
Constructing simulated RGB-D scenes and wiring up perception inputs to a downstream consumer is the kind of pipeline plumbing CV engineers own at robotics companies.
This challenge sharpens
- embodied-vision
- simulation
- pytorch