Design a Distributed Training Job for a 13B-Parameter Model

FreeVerified credential3 weeksExpert

Overview

What this challenge is about.

Decide whether to use Fully Sharded Data Parallel (FSDP), Tensor Parallelism, Pipeline Parallelism, or a hybrid; justify against the 13B-param + 32-H100 setup. Calculate memory per GPU, optimal microbatch and global batch sizes, and the expected tokens-per-second throughput. Pick a checkpoint strategy that survives a single-node failure with under 30 minutes of lost work. Validate the plan on a smaller proxy (e.g., a 1.3B model on 4 GPUs) and report measured vs. predicted throughput. Write the 6-page runbook the research team will execute against.

CredentialBlockchain-anchored

ShareableLinkedIn-ready

LanguageEnglish

PaceSelf-paced

The Brief

What you'll do, and what you'll demonstrate.

Pick and justify a distributed-training strategy for a 13B-param model on 32 H100s, validated on a smaller proxy, and write the runbook.

Earning criteria — what you'll demonstrate

Choose between FSDP, TP, PP, and hybrid parallelism for a real workload
Calculate per-GPU memory and throughput budgets from first principles
Design fault-tolerant checkpoint cadence and storage
Communicate distributed-training design to a non-systems audience

Program Fit

Where this fits in your program.

Sharpens the same skills your degree expects you to demonstrate.

Machine Learning Systems

Master · Ai Ml

Fit score: 1

Skills

Skills you'll demonstrate.

Each one shows up on your verified credential.

Careers

Roles this prepares you for.

Real titles. Real skill bridges. Pick the one closest to your trajectory.

Career paths this builds toward

Canonical roles

Machine Learning Engineer
AI Engineering

Machine Learning Engineer

Designing a real distributed-training plan with throughput modeling and a runbook is the work that staff-track MLEs lead at any team training 10B+ models.

This challenge sharpens

distributed-training
fsdp
throughput-modeling

AI Solutions Architect

Choosing parallelism strategies and writing the runbook that a client research team executes against is core AI solutions architecture work at cloud and consulting orgs.

This challenge sharpens

distributed-training
checkpointing
gpu-systems

MLOps Engineer

Checkpoint cadence, failure recovery, and infrastructure-shaped training plans are the daily concerns of MLOps engineers on training-platform teams.

This challenge sharpens

checkpointing
gpu-systems
pytorch

One more thing

You can put a credential on your CV by Friday.

Start this challenge