Approximate Inference for a Topic Model on Customer Tickets
Overview
What this challenge is about.
You receive 180,000 tickets (subject + body) spanning the last 18 months. Preprocess into a bag-of-words representation with sensible stopwords and bigrams. Fit a 20-topic LDA via stochastic variational inference (SVI) and via collapsed Gibbs sampling. Compare on (a) wall-clock training time, (b) held-out per-word perplexity on a 10 percent test split, and (c) topic stability across two consecutive weekly snapshots, measured by best-matching topic-word Jaccard overlap. Wrap the winner in a Monday-morning refresh job and write a 1-page note on why the topics drifted before.
The Brief
What you'll do, and what you'll demonstrate.
Compare variational and Gibbs inference for a weekly-refreshed LDA topic model on support tickets, and recommend one with documented trade-offs.
Earning criteria — what you'll demonstrate
- Implement and compare stochastic variational inference vs. collapsed Gibbs sampling
- Measure topic-model quality with held-out perplexity and stability metrics
- Diagnose and explain topic drift in production
- Translate a probabilistic-inference choice into a business-readable note
Program Fit
Where this fits in your program.
Sharpens the same skills your degree expects you to demonstrate.
Skills
Skills you'll demonstrate.
Each one shows up on your verified credential.
Careers
Roles this prepares you for.
Real titles. Real skill bridges. Pick the one closest to your trajectory.
Machine Learning Engineer
Choosing an inference algorithm under real production constraints (weekly refresh, stability, latency) is the kind of MLE judgement call hiring managers look for.
This challenge sharpens
- variational-inference
- python
- model-evaluation
NLP Engineer
Topic modeling on real support text plus text preprocessing at scale is core NLP-engineer territory at any product-led SaaS.
This challenge sharpens
- latent-dirichlet-allocation
- text-processing
- model-evaluation
Data Scientist
Diagnosing why a probabilistic model drifted week-over-week and communicating the fix is exactly what data scientists do when dashboards lose trust.
This challenge sharpens
- approximate-inference
- model-evaluation
- text-processing