Fine-Tune a Vision-Language Model for Image Captioning
Overview
What this challenge is about.
Take BLIP-2 or LLaVA-1.6 as the base. Fine-tune (LoRA is fine) on a 4,000-image accessibility-curated dataset where each image has a useful caption written by a low-vision-experienced annotator. Use an instruction-following caption style. Evaluate with both automated metrics (CIDEr, SPICE) and a 30-user study comparing fine-tuned vs base captions on perceived usefulness. Report a Likert-scale comparison. Write a 4-page memo with a go/no-go recommendation on shipping the fine-tune.
The Brief
What you'll do, and what you'll demonstrate.
Fine-tune a vision-language model so its captions are actually useful for low-vision users, validated by a 30-user study.
Earning criteria — what you'll demonstrate
- Fine-tune a vision-language model with parameter-efficient methods
- Design a user study that measures real downstream usefulness
- Balance automated metrics with human judgment
- Make a ship/no-ship call on a model fine-tune
Program Fit
Where this fits in your program.
Sharpens the same skills your degree expects you to demonstrate.
Skills
Skills you'll demonstrate.
Each one shows up on your verified credential.
Careers
Roles this prepares you for.
Real titles. Real skill bridges. Pick the one closest to your trajectory.
Applied AI Scientist
Fine-tuning vision-language models for a specific user need and validating with a real user study is the day-job of applied AI scientists at consumer AI startups.
This challenge sharpens
- vision-language-models
- lora-fine-tuning
- user-study-design
ML Researcher
Balancing CIDEr/SPICE against human judgments is the kind of methodology-rigor that ML-research teams need for any captioning or generation evaluation.
This challenge sharpens
- image-captioning
- evaluation
- lora-fine-tuning
AI Product Designer
Working with low-vision users to define what 'useful' means and designing the comparison study is the AI product designer's craft on accessibility-focused products.
This challenge sharpens
- user-study-design
- image-captioning
- evaluation