Overview
What this challenge is about.
You receive 6 months of crawled public posts (~400,000 posts, already cleaned of usernames) and access to a UMLS API for normalisation. Build a pipeline that does (1) symptom extraction with scispaCy, (2) normalisation to UMLS CUIs, (3) weekly aggregation per symptom, (4) anomaly detection (Prophet or a simple z-score over rolling baselines). Build a small Streamlit dashboard showing top anomalies per week. Write a 4-page methodology and ethics memo covering data source, anonymization, and what the signal does and does not say.
The Brief
What you'll do, and what you'll demonstrate.
Surface statistically meaningful symptom-cluster anomalies from public health-forum posts with rigorous methodology and ethics framing.
Earning criteria — what you'll demonstrate
- Apply scispaCy and UMLS normalisation to health-related text
- Build a simple anomaly-detection layer on top of weekly aggregates
- Reason about ethics framing for public-data analysis
- Communicate text-mining signals with appropriate caveats
Program Fit
Where this fits in your program.
Sharpens the same skills your degree expects you to demonstrate.
Skills
Skills you'll demonstrate.
Each one shows up on your verified credential.
Careers
Roles this prepares you for.
Real titles. Real skill bridges. Pick the one closest to your trajectory.
Career paths this builds toward
Canonical rolesData Scientist
Text-mining for public-health signal with a rigorous ethics framing is the day-to-day of data scientists at any health-data company.
This challenge sharpens
- text-mining
- anomaly-detection
- ethics-framing
NLP Engineer
Biomedical NER plus UMLS normalisation is the NLP-engineer skillset that healthtech vendors hire for.
This challenge sharpens
- biomedical-nlp
- umls-normalization
- scispacy
AI Safety Researcher
Framing what a signal does and does not claim is core safety-research work on public-data AI products.
This challenge sharpens
- ethics-framing
- anomaly-detection
- text-mining