Cost-Profile a Spark Job at Scale and Cut the Bill in Half
Overview
What this challenge is about.
Receive the PySpark job (around 1,800 lines), 5 nights of Spark UI + EMR metrics, and the EMR cluster config. Profile to find the top 3 cost drivers (likely candidates: skewed joins on customer ID, oversized shuffles, instance-type misallocation, missing predicate pushdown). Prototype 3 optimizations on a 1TB representative subset and measure cost + runtime delta. Extrapolate to full scale with a defensible cost model. Run a full-scale dry-run on the best combination to confirm the 6-hour SLA. Deliver the profiling report, optimization branches, a 1TB benchmark table, the extrapolation model, and a 5-page recommendation memo with rollout order.
The Brief
What you'll do, and what you'll demonstrate.
Cut a 6TB-per-day Spark job's bill in half without breaking the 6-hour SLA, and document the optimizations defensibly enough for finance to fund the engineer-time.
Earning criteria — what you'll demonstrate
- Profile a real Spark job using Spark UI + cloud metrics
- Apply standard Spark optimizations (broadcast joins, partition tuning, AQE)
- Build a defensible cost extrapolation from subset to full scale
- Communicate cost trade-offs to finance + engineering stakeholders
Program Fit
Where this fits in your program.
Sharpens the same skills your degree expects you to demonstrate.
Skills
Skills you'll demonstrate.
Each one shows up on your verified credential.
Careers
Roles this prepares you for.
Real titles. Real skill bridges. Pick the one closest to your trajectory.
Career mappings coming soon.