Overview
What this challenge is about.
You will use a small open-source vision-language model (e.g., LLaVA-1.5-7B or PaliGemma) and prompt-engineer it for the warehouse-VQA task. Build a Gradio web demo. Construct a 200-question evaluation set covering counting, presence, condition, and location questions, with reference answers. Report accuracy per question type and surface failure modes. Deliver the demo, an evaluation notebook, and a 10-slide deck for the client presentation.
The Brief
What you'll do, and what you'll demonstrate.
Build a working warehouse-VQA demo on a small vision-language model and quantify accuracy per question type.
Earning criteria — what you'll demonstrate
- Apply a small open-source vision-language model to a domain VQA task
- Prompt-engineer for grounded multimodal reasoning
- Construct a balanced VQA evaluation set across question types
- Present a working multimodal demo to a mixed client audience
Program Fit
Where this fits in your program.
Sharpens the same skills your degree expects you to demonstrate.
Skills
Skills you'll demonstrate.
Each one shows up on your verified credential.
Careers
Roles this prepares you for.
Real titles. Real skill bridges. Pick the one closest to your trajectory.
Career paths this builds toward
Canonical rolesAI Engineer
Shipping a working multimodal demo for a real client meeting is the AI-engineer's signature deliverable at consultancies and AI-forward product teams.
This challenge sharpens
- vision-language-models
- demo-development
- prompt-engineering
Prompt Engineer
Designing and iterating prompts for grounded multimodal reasoning is exactly the prompt-engineer skill set hiring managers screen for.
This challenge sharpens
- prompt-engineering
- vision-language-models
- evaluation
Applied AI Scientist
Constructing a balanced VQA evaluation set and reporting per-question-type accuracy is the applied-AI-scientist's craft when shipping new capabilities.
This challenge sharpens
- multimodal-perception
- evaluation
- vision-language-models