Variational Autoencoder for Synthetic Tabular Banking Data

FreeVerified credential2 weeksAdvanced

Overview

What this challenge is about.

You receive a 500K-row anonymized transaction dataset with 25 columns (mixed numerical + categorical). Train a VAE (TabVAE or a small custom model) with appropriate likelihoods per column type. Generate a 500K-row synthetic dataset. Evaluate utility via the 'train-on-synthetic, test-on-real' (TSTR) accuracy of a downstream gradient-boosted classifier predicting a held-out fraud label. Evaluate privacy via Membership Inference attack AUC and nearest-neighbor distance ratio. Compare to the histogram baseline on both axes and recommend in a 2-page report whether the VAE is good enough to ship to partners.

CredentialBlockchain-anchored

ShareableLinkedIn-ready

LanguageEnglish

PaceSelf-paced

The Brief

What you'll do, and what you'll demonstrate.

Train a VAE on banking transactions and demonstrate that it generates synthetic data that is more useful and at least as private as a histogram baseline.

Earning criteria — what you'll demonstrate

Build and train a VAE with per-column likelihoods on mixed-type tabular data
Apply utility metrics (TSTR) and privacy metrics (MIA) to evaluate synthetic data
Reason about the privacy/utility trade-off in generative models
Communicate generative-model results to a non-ML data-sharing committee

Program Fit

Where this fits in your program.

Sharpens the same skills your degree expects you to demonstrate.

Probabilistic Machine Learning

Master · Ai Ml

Fit score: 1

Skills

Skills you'll demonstrate.

Each one shows up on your verified credential.

Careers

Roles this prepares you for.

Real titles. Real skill bridges. Pick the one closest to your trajectory.

Career paths this builds toward

Canonical roles

Machine Learning Engineer
AI Engineering

ML Researcher

Designing a privacy-aware generative model with rigorous utility/privacy evaluation is the kind of project that opens doors at applied-research teams in finance, health, and government.

This challenge sharpens

variational-inference
deep-generative-models
synthetic-data

Applied AI Scientist

Trading off privacy and utility on real banking data is the day-to-day reality of applied AI scientists at regulated startups.

This challenge sharpens

deep-generative-models
synthetic-data
privacy-evaluation

Machine Learning Engineer

Productionizing a VAE training + evaluation pipeline that another engineer can rerun is core MLE craft.

This challenge sharpens

pytorch
tabular-data
synthetic-data

One more thing

You can put a credential on your CV by Friday.

Start this challenge