Build a Multimodal Generation Pipeline for a Tourism Operator
Overview
What this challenge is about.
You receive 40 sample 30-second videos shot by tour guides, the operator's brand voice doc, and SEO keyword lists for EN/PT/ES. Build a pipeline that (1) extracts a representative frame, (2) generates a captioned still image (text overlay can be downstream), (3) generates a 220-character caption in each language, (4) suggests 3 hashtags. Use open multimodal models (e.g., LLaVA, Qwen2-VL, Whisper for any voiceover). Evaluate on a 4-person guide panel rating on-brand fit + SEO usefulness. Write the weekly operations playbook for guides.
The Brief
What you'll do, and what you'll demonstrate.
Build a multimodal generation pipeline that turns a 30-second tour video into a publish-ready social post bundle in EN/PT/ES.
Earning criteria — what you'll demonstrate
- Compose a multimodal generation pipeline from open models
- Apply vision-language models to a real product use case
- Evaluate multilingual generation against brand and SEO criteria
- Communicate a generative pipeline as an operational tool
Program Fit
Where this fits in your program.
Sharpens the same skills your degree expects you to demonstrate.
Skills
Skills you'll demonstrate.
Each one shows up on your verified credential.
Careers
Roles this prepares you for.
Real titles. Real skill bridges. Pick the one closest to your trajectory.
AI Engineer
Composing a multimodal pipeline into an operational tool a non-engineer can run is exactly the day-one work of an AI engineer at any consumer-AI or content-tech firm.
This challenge sharpens
- multimodal-generation
- vision-language-models
- llm-inference
NLP Engineer
Multilingual caption generation under length constraints with brand rules is core NLP-engineer work in content and marketing-AI tools.
This challenge sharpens
- llm-inference
- prompt-engineering
- evaluation
AI Product Designer
Designing the per-language output contract and writing the guide-facing playbook is the AI product-designer craft of building tools real operators trust.
This challenge sharpens
- prompt-engineering
- evaluation
- multimodal-generation