Train a VAE for Synthetic Tabular Data at a Healthtech Startup
Overview
What this challenge is about.
You receive a synthetic-but-realistic clinical-trial table (around 50,000 patients, 35 columns, mixed continuous and categorical). Train a tabular VAE (or TVAE/CTGAN as alternates). Evaluate utility via downstream model fidelity (train on synthetic, test on real, compare to real-on-real) and privacy via a basic membership-inference attack. Tune the privacy/utility trade-off and recommend a release setting. Deliverable is the trained model, the evaluation report, and a 3-page data-sharing decision memo.
The Brief
What you'll do, and what you'll demonstrate.
Train a VAE-based synthetic data generator that meets utility and privacy thresholds acceptable for academic data-sharing.
Earning criteria — what you'll demonstrate
- Adapt VAE training to mixed-type tabular data
- Evaluate synthetic data utility via downstream-model fidelity
- Implement and interpret a membership-inference attack
- Reason about the privacy/utility trade-off for real data-sharing decisions
Program Fit
Where this fits in your program.
Sharpens the same skills your degree expects you to demonstrate.
Skills
Skills you'll demonstrate.
Each one shows up on your verified credential.
Careers
Roles this prepares you for.
Real titles. Real skill bridges. Pick the one closest to your trajectory.
Research Scientist
Synthetic-data work with formal utility and privacy evaluation is a strong portfolio piece for any privacy-ML or generative research role.
This challenge sharpens
- vae
- synthetic-data
- privacy-evaluation
ML Researcher
Tabular VAEs and their utility/privacy trade-offs are an active research area; this challenge produces a credible first publication-ready artifact.
This challenge sharpens
- vae
- tabular-generation
- utility-evaluation
AI Safety Researcher
Privacy evaluation and membership-inference attacks are core AI safety research methods.
This challenge sharpens
- privacy-evaluation
- synthetic-data
- vae