Skip to contentSkip to content
Verified credentials. On-chain. Forever.Learn more
Cover image for Design Eval Suite for a Multimodal Brainstorming Assistant
Design

Design Eval Suite for a Multimodal Brainstorming Assistant

FreeVerified credential4 weeksExpert

Overview

What this challenge is about.

You receive (1) the assistant's current API, (2) a list of 6 launch user-personas, and (3) the product team's quality target ('beat the previous model on 4 of 6 personas'). Design an evaluation suite with: per-modality quality metrics (text BLEU/ROUGE plus a rubric; image CLIPScore; voice WER); factuality probes (50 fact-check questions); safety probes (200 prompts spanning categories); creativity rubric (3 raters, 1-5 scale). Implement as a Python harness runnable with one command. Write a 3-page eval-suite spec doc and run it against two model versions to demonstrate it surfaces real differences.

CredentialBlockchain-anchored
ShareableLinkedIn-ready
LanguageEnglish
PaceSelf-paced

The Brief

What you'll do, and what you'll demonstrate.

Design and prototype a CI-runnable evaluation suite for a multimodal brainstorming assistant covering quality, factuality, safety, and creativity.

Earning criteria — what you'll demonstrate

  • Design a multimodal evaluation suite balancing automated + rubric scores
  • Build safety + factuality probe sets that survive future model changes
  • Engineer eval as code, runnable in CI
  • Communicate eval-suite trade-offs to product + safety leadership

Program Fit

Where this fits in your program.

Sharpens the same skills your degree expects you to demonstrate.

Careers

Roles this prepares you for.

Real titles. Real skill bridges. Pick the one closest to your trajectory.

Career paths this builds toward

Canonical roles

AI Product Manager

Designing the eval suite that gates a consumer launch is exactly the day-one work of an AI PM at any consumer-AI company.

This challenge sharpens

  • rubric-design
  • llm-evaluation
  • safety-evaluation

AI Safety Researcher

Building versioned safety + factuality probe sets that survive model swaps is core AI safety work in product-led organizations.

This challenge sharpens

  • safety-evaluation
  • rubric-design
  • multimodal-evaluation

MLOps Engineer

Shipping evaluation as a CI-runnable harness is the MLOps craft of making model-quality gates automatic and reliable.

This challenge sharpens

  • ci-integration
  • python
  • llm-evaluation

One more thing

You can put a credential on your CV by Friday.