Predict Catalyst Properties for a Green-Hydrogen Pharma Spinout
Overview
What this challenge is about.
Use an open catalyst dataset (e.g., Open Catalyst Project subset, or a Materials Project pull) where each candidate has descriptors and a target activity property. Train a tabular model (CatBoost, XGBoost, or a small graph neural network if you go ambitious) plus a quantile-regression sibling for uncertainty. Evaluate root mean squared error, calibration of predicted intervals, and ranking quality (precision-at-10 of true top performers). Wrap the model in a CSV-in, ranked-CSV-out command-line tool the bench team can run unsupervised. Deliver a 2-page memo for the next prioritization meeting.
The Brief
What you'll do, and what you'll demonstrate.
Ship a ranking tool that prioritizes catalyst candidates for synthesis using a calibrated ML model.
Earning criteria — what you'll demonstrate
- Apply tabular ML to a real scientific dataset with chemistry descriptors
- Quantify and calibrate predictive uncertainty for ranking decisions
- Translate a model into a tool a non-ML bench scientist can run
- Communicate ranking quality with metrics chemists understand
Program Fit
Where this fits in your program.
Sharpens the same skills your degree expects you to demonstrate.
Skills
Skills you'll demonstrate.
Each one shows up on your verified credential.
Careers
Roles this prepares you for.
Real titles. Real skill bridges. Pick the one closest to your trajectory.
Data Scientist
Owning a tabular regression model end-to-end for a domain audience is the daily reality of a data scientist embedded in an R-and-D team.
This challenge sharpens
- tabular-modeling
- feature-engineering
- ranking-evaluation
Applied AI Scientist
Quantifying uncertainty for ranking decisions in a scientific context is exactly the applied-AI bridge into a chemistry or materials team.
This challenge sharpens
- uncertainty-quantification
- scientific-ml
- ranking-evaluation
Machine Learning Engineer
Packaging the model behind a CLI tool with clean inputs and outputs is the MLE's productionization craft.
This challenge sharpens
- python
- tabular-modeling
- feature-engineering