Overview
What this challenge is about.
Build the pipeline: face detection + active-speaker detection on video, voice-activity detection + speaker embeddings on audio, then a fusion step that ties tracks to detected faces. Use open models (RetinaFace + TalkNet for AV, ECAPA-TDNN or pyannote for audio). Evaluate on a 50-session held-out test (with manual ground-truth labels). Report Diarization Error Rate (DER) and per-speaker accuracy, vs an audio-only baseline. Demonstrate a working web demo for one session. Write a 4-page handoff doc for the platform team.
The Brief
What you'll do, and what you'll demonstrate.
Cut Diarization Error Rate on tutoring sessions by fusing audio + video and prove the win on a 50-session held-out test.
Earning criteria — what you'll demonstrate
- Combine audio and video modalities at the right granularity
- Apply active-speaker detection to disambiguate similar voices
- Evaluate diarization with standard metrics (DER, JER)
- Hand off a multimodal pipeline to a non-ML platform team
Program Fit
Where this fits in your program.
Sharpens the same skills your degree expects you to demonstrate.
Skills
Skills you'll demonstrate.
Each one shows up on your verified credential.
Careers
Roles this prepares you for.
Real titles. Real skill bridges. Pick the one closest to your trajectory.
ML Researcher
Fusing audio + video for diarization with honest DER evaluation is the applied multimodal research that edtech and conferencing AI teams hire for.
This challenge sharpens
- audio-visual-fusion
- speaker-diarization
- evaluation
Applied AI Scientist
Translating open research models into a working multimodal pipeline with a demo and handoff is core applied-AI-scientist work at AI-first startups.
This challenge sharpens
- audio-visual-fusion
- active-speaker-detection
- pytorch
Machine Learning Engineer
Shipping a production-shape AV pipeline that the platform team adopts is exactly the MLE work that edtech AI teams need on roadmap.
This challenge sharpens
- pytorch
- pyannote
- speaker-diarization