Build a Small Transformer from Scratch and Train It on Code
Overview
What this challenge is about.
Implement multi-head self-attention, RMSNorm, rotary positional embeddings, and a causal LM head from scratch — no Hugging Face shortcuts for the model code (you may use Hugging Face Tokenizers for BPE). Train on a small code corpus (subset of The Stack or a curated Python-only dump, around 800M tokens) on a single A100 for one GPU-day. Report training loss curves, evaluation perplexity on a held-out split, and one sample generation per epoch. Write a 4-page learnings note covering one attention-head visualization, one positional-encoding experiment, and one training-instability fix you encountered.
The Brief
What you'll do, and what you'll demonstrate.
Implement and train a 30M-parameter decoder-only transformer from scratch on a code corpus with proven attention + training understanding.
Earning criteria — what you'll demonstrate
- Implement self-attention, RoPE, and an LM head from first principles
- Train a transformer end-to-end on a non-toy corpus
- Visualize and interpret attention patterns
- Diagnose and fix training instabilities (loss spikes, NaNs)
Program Fit
Where this fits in your program.
Sharpens the same skills your degree expects you to demonstrate.
Skills
Skills you'll demonstrate.
Each one shows up on your verified credential.
Careers
Roles this prepares you for.
Real titles. Real skill bridges. Pick the one closest to your trajectory.
ML Researcher
From-scratch transformer implementation is the canonical research-team initiation; this challenge gives the student exactly that portfolio piece.
This challenge sharpens
- transformers
- self-attention
- language-modeling
Research Scientist
Implementing and ablating positional encodings is the kind of foundational work that research scientists do daily on architecture-research teams.
This challenge sharpens
- self-attention
- rope
- training-debugging
Applied AI Scientist
Deep PyTorch fluency at the layer-implementation level translates directly into applied AI work where standard frameworks aren't enough.
This challenge sharpens
- pytorch
- transformers
- training-debugging