Design a Distributed Training Job for a 13B-Parameter Model
Overview
What this challenge is about.
Decide whether to use Fully Sharded Data Parallel (FSDP), Tensor Parallelism, Pipeline Parallelism, or a hybrid; justify against the 13B-param + 32-H100 setup. Calculate memory per GPU, optimal microbatch and global batch sizes, and the expected tokens-per-second throughput. Pick a checkpoint strategy that survives a single-node failure with under 30 minutes of lost work. Validate the plan on a smaller proxy (e.g., a 1.3B model on 4 GPUs) and report measured vs. predicted throughput. Write the 6-page runbook the research team will execute against.
The Brief
What you'll do, and what you'll demonstrate.
Pick and justify a distributed-training strategy for a 13B-param model on 32 H100s, validated on a smaller proxy, and write the runbook.
Earning criteria — what you'll demonstrate
- Choose between FSDP, TP, PP, and hybrid parallelism for a real workload
- Calculate per-GPU memory and throughput budgets from first principles
- Design fault-tolerant checkpoint cadence and storage
- Communicate distributed-training design to a non-systems audience
Program Fit
Where this fits in your program.
Sharpens the same skills your degree expects you to demonstrate.
Skills
Skills you'll demonstrate.
Each one shows up on your verified credential.
Careers
Roles this prepares you for.
Real titles. Real skill bridges. Pick the one closest to your trajectory.
Machine Learning Engineer
Designing a real distributed-training plan with throughput modeling and a runbook is the work that staff-track MLEs lead at any team training 10B+ models.
This challenge sharpens
- distributed-training
- fsdp
- throughput-modeling
AI Solutions Architect
Choosing parallelism strategies and writing the runbook that a client research team executes against is core AI solutions architecture work at cloud and consulting orgs.
This challenge sharpens
- distributed-training
- checkpointing
- gpu-systems
MLOps Engineer
Checkpoint cadence, failure recovery, and infrastructure-shaped training plans are the daily concerns of MLOps engineers on training-platform teams.
This challenge sharpens
- checkpointing
- gpu-systems
- pytorch