Overview
What this challenge is about.
Stand up a Python (pandas + DuckDB) audit notebook ingesting the 14M-record extract. Define and run quality checks across four dimensions: completeness (required-field missingness rates by practice), consistency (free-text vs. SNOMED-coded diagnosis alignment), plausibility (vitals within human-physiological bounds, patient ages 0-115), and temporal correctness (encounter timestamps after birth, before extract date, sane timezone). Produce per-practice and aggregate scorecards. Deliver a 16-page data-quality report with severity-ranked findings, a per-practice remediation list, and reproducible audit code.
The Brief
What you'll do, and what you'll demonstrate.
Audit 14M de-identified EHR records across completeness, consistency, plausibility, and temporal correctness, and publish a severity-ranked report with remediation guidance.
Earning criteria — what you'll demonstrate
- Design data-quality checks across multiple dimensions on real EHR data
- Map free-text clinical fields against SNOMED CT codings
- Communicate data-quality findings to a non-engineering audience
- Make scientific audits reproducible without leaking patient identifiers
Program Fit
Where this fits in your program.
Sharpens the same skills your degree expects you to demonstrate.
Skills
Skills you'll demonstrate.
Each one shows up on your verified credential.
Careers
Roles this prepares you for.
Real titles. Real skill bridges. Pick the one closest to your trajectory.
Product Manager
Product managers in HealthTech who understand EHR data-quality realities ship realistic roadmaps instead of demos that break on day-one production data.
This challenge sharpens
- health-informatics
- data-quality
- snomed-ct