Build an Audio-Visual Speaker Diarization Pipeline

FreeVerified credential4 weeksAdvanced

Overview

What this challenge is about.

Build the pipeline: face detection + active-speaker detection on video, voice-activity detection + speaker embeddings on audio, then a fusion step that ties tracks to detected faces. Use open models (RetinaFace + TalkNet for AV, ECAPA-TDNN or pyannote for audio). Evaluate on a 50-session held-out test (with manual ground-truth labels). Report Diarization Error Rate (DER) and per-speaker accuracy, vs an audio-only baseline. Demonstrate a working web demo for one session. Write a 4-page handoff doc for the platform team.

CredentialBlockchain-anchored

ShareableLinkedIn-ready

LanguageEnglish

PaceSelf-paced

The Brief

What you'll do, and what you'll demonstrate.

Cut Diarization Error Rate on tutoring sessions by fusing audio + video and prove the win on a 50-session held-out test.

Earning criteria — what you'll demonstrate

Combine audio and video modalities at the right granularity
Apply active-speaker detection to disambiguate similar voices
Evaluate diarization with standard metrics (DER, JER)
Hand off a multimodal pipeline to a non-ML platform team

Program Fit

Where this fits in your program.

Sharpens the same skills your degree expects you to demonstrate.

Multimodal Machine Learning

Master · Ai Ml

Fit score: 1

Skills

Skills you'll demonstrate.

Each one shows up on your verified credential.

Careers

Roles this prepares you for.

Real titles. Real skill bridges. Pick the one closest to your trajectory.

Career paths this builds toward

Canonical roles

Machine Learning Engineer
AI Engineering

ML Researcher

Fusing audio + video for diarization with honest DER evaluation is the applied multimodal research that edtech and conferencing AI teams hire for.

This challenge sharpens

audio-visual-fusion
speaker-diarization
evaluation

Applied AI Scientist

Translating open research models into a working multimodal pipeline with a demo and handoff is core applied-AI-scientist work at AI-first startups.

This challenge sharpens

audio-visual-fusion
active-speaker-detection
pytorch

Machine Learning Engineer

Shipping a production-shape AV pipeline that the platform team adopts is exactly the MLE work that edtech AI teams need on roadmap.

This challenge sharpens

pytorch
pyannote
speaker-diarization

One more thing

You can put a credential on your CV by Friday.

Start this challenge