Design Eval Suite for a Multimodal Brainstorming Assistant
Overview
What this challenge is about.
You receive (1) the assistant's current API, (2) a list of 6 launch user-personas, and (3) the product team's quality target ('beat the previous model on 4 of 6 personas'). Design an evaluation suite with: per-modality quality metrics (text BLEU/ROUGE plus a rubric; image CLIPScore; voice WER); factuality probes (50 fact-check questions); safety probes (200 prompts spanning categories); creativity rubric (3 raters, 1-5 scale). Implement as a Python harness runnable with one command. Write a 3-page eval-suite spec doc and run it against two model versions to demonstrate it surfaces real differences.
The Brief
What you'll do, and what you'll demonstrate.
Design and prototype a CI-runnable evaluation suite for a multimodal brainstorming assistant covering quality, factuality, safety, and creativity.
Earning criteria — what you'll demonstrate
- Design a multimodal evaluation suite balancing automated + rubric scores
- Build safety + factuality probe sets that survive future model changes
- Engineer eval as code, runnable in CI
- Communicate eval-suite trade-offs to product + safety leadership
Program Fit
Where this fits in your program.
Sharpens the same skills your degree expects you to demonstrate.
Skills
Skills you'll demonstrate.
Each one shows up on your verified credential.
Careers
Roles this prepares you for.
Real titles. Real skill bridges. Pick the one closest to your trajectory.
Career paths this builds toward
Canonical rolesAI Product Manager
Designing the eval suite that gates a consumer launch is exactly the day-one work of an AI PM at any consumer-AI company.
This challenge sharpens
- rubric-design
- llm-evaluation
- safety-evaluation
AI Safety Researcher
Building versioned safety + factuality probe sets that survive model swaps is core AI safety work in product-led organizations.
This challenge sharpens
- safety-evaluation
- rubric-design
- multimodal-evaluation
MLOps Engineer
Shipping evaluation as a CI-runnable harness is the MLOps craft of making model-quality gates automatic and reliable.
This challenge sharpens
- ci-integration
- python
- llm-evaluation