Variational Autoencoder for Synthetic Tabular Banking Data
Overview
What this challenge is about.
You receive a 500K-row anonymized transaction dataset with 25 columns (mixed numerical + categorical). Train a VAE (TabVAE or a small custom model) with appropriate likelihoods per column type. Generate a 500K-row synthetic dataset. Evaluate utility via the 'train-on-synthetic, test-on-real' (TSTR) accuracy of a downstream gradient-boosted classifier predicting a held-out fraud label. Evaluate privacy via Membership Inference attack AUC and nearest-neighbor distance ratio. Compare to the histogram baseline on both axes and recommend in a 2-page report whether the VAE is good enough to ship to partners.
The Brief
What you'll do, and what you'll demonstrate.
Train a VAE on banking transactions and demonstrate that it generates synthetic data that is more useful and at least as private as a histogram baseline.
Earning criteria — what you'll demonstrate
- Build and train a VAE with per-column likelihoods on mixed-type tabular data
- Apply utility metrics (TSTR) and privacy metrics (MIA) to evaluate synthetic data
- Reason about the privacy/utility trade-off in generative models
- Communicate generative-model results to a non-ML data-sharing committee
Program Fit
Where this fits in your program.
Sharpens the same skills your degree expects you to demonstrate.
Skills
Skills you'll demonstrate.
Each one shows up on your verified credential.
Careers
Roles this prepares you for.
Real titles. Real skill bridges. Pick the one closest to your trajectory.
ML Researcher
Designing a privacy-aware generative model with rigorous utility/privacy evaluation is the kind of project that opens doors at applied-research teams in finance, health, and government.
This challenge sharpens
- variational-inference
- deep-generative-models
- synthetic-data
Applied AI Scientist
Trading off privacy and utility on real banking data is the day-to-day reality of applied AI scientists at regulated startups.
This challenge sharpens
- deep-generative-models
- synthetic-data
- privacy-evaluation
Machine Learning Engineer
Productionizing a VAE training + evaluation pipeline that another engineer can rerun is core MLE craft.
This challenge sharpens
- pytorch
- tabular-data
- synthetic-data