Skip to contentSkip to content
Verified credentials. On-chain. Forever.Learn more
Cover image for Build an Audio-Visual Speaker Diarization Pipeline
Code

Build an Audio-Visual Speaker Diarization Pipeline

FreeVerified credential4 weeksAdvanced

Overview

What this challenge is about.

Build the pipeline: face detection + active-speaker detection on video, voice-activity detection + speaker embeddings on audio, then a fusion step that ties tracks to detected faces. Use open models (RetinaFace + TalkNet for AV, ECAPA-TDNN or pyannote for audio). Evaluate on a 50-session held-out test (with manual ground-truth labels). Report Diarization Error Rate (DER) and per-speaker accuracy, vs an audio-only baseline. Demonstrate a working web demo for one session. Write a 4-page handoff doc for the platform team.

CredentialBlockchain-anchored
ShareableLinkedIn-ready
LanguageEnglish
PaceSelf-paced

The Brief

What you'll do, and what you'll demonstrate.

Cut Diarization Error Rate on tutoring sessions by fusing audio + video and prove the win on a 50-session held-out test.

Earning criteria — what you'll demonstrate

  • Combine audio and video modalities at the right granularity
  • Apply active-speaker detection to disambiguate similar voices
  • Evaluate diarization with standard metrics (DER, JER)
  • Hand off a multimodal pipeline to a non-ML platform team

Program Fit

Where this fits in your program.

Sharpens the same skills your degree expects you to demonstrate.

Skills

Skills you'll demonstrate.

Each one shows up on your verified credential.

Careers

Roles this prepares you for.

Real titles. Real skill bridges. Pick the one closest to your trajectory.

ML Researcher

Fusing audio + video for diarization with honest DER evaluation is the applied multimodal research that edtech and conferencing AI teams hire for.

This challenge sharpens

  • audio-visual-fusion
  • speaker-diarization
  • evaluation

Applied AI Scientist

Translating open research models into a working multimodal pipeline with a demo and handoff is core applied-AI-scientist work at AI-first startups.

This challenge sharpens

  • audio-visual-fusion
  • active-speaker-detection
  • pytorch

Machine Learning Engineer

Shipping a production-shape AV pipeline that the platform team adopts is exactly the MLE work that edtech AI teams need on roadmap.

This challenge sharpens

  • pytorch
  • pyannote
  • speaker-diarization

One more thing

You can put a credential on your CV by Friday.