Scale Feature Pipelines for a Hyperscaler Search-Ranking Team
Overview
What this challenge is about.
You receive a synthetic-but-realistic 80 GB sample of the ranking events plus the existing Spark pipeline (PySpark) and a Spark UI snapshot from a recent production run. Profile the production-grade run, identify three concrete bottlenecks (e.g., skewed join key, exploded UDF, oversized shuffle), and prototype fixes on the sample. Measure each fix's effect, then estimate the production effect with a documented extrapolation. Deliver the pre-RFC, the prototype branch, and a 1-page exec summary.
The Brief
What you'll do, and what you'll demonstrate.
Halve a 14-hour Spark feature pipeline's wall-clock time at unchanged compute spend.
Earning criteria — what you'll demonstrate
- Read and interpret a Spark UI to find real bottlenecks
- Apply skew-handling, partition-tuning, and UDF-elimination patterns
- Extrapolate sample-scale measurements to production reliably
- Write a pre-RFC that engineering peers can review
Program Fit
Where this fits in your program.
Sharpens the same skills your degree expects you to demonstrate.
Skills
Skills you'll demonstrate.
Each one shows up on your verified credential.
Careers
Roles this prepares you for.
Real titles. Real skill bridges. Pick the one closest to your trajectory.
Career paths this builds toward
Canonical rolesData Engineer
Profiling and tuning a production-grade Spark pipeline with a written pre-RFC is the textbook senior-data-engineer task at any hyperscaler.
This challenge sharpens
- spark
- data-pipelines
- performance-profiling
Machine Learning Engineer
Owning feature-pipeline performance is increasingly part of the MLE remit because slow features starve model iteration.
This challenge sharpens
- data-pipelines
- performance-profiling
- python
MLOps Engineer
Cost-aware infrastructure work on shared compute platforms is core MLOps territory; this challenge practices the discipline.
This challenge sharpens
- spark
- cost-aware-engineering
- distributed-systems