Cost-Optimize a Large-Scale Spark Job for an Ad-Tech Platform
Overview
What this challenge is about.
You receive the Spark job source (PySpark), the EMR cluster config, and 5 nights of job-history JSON. Profile the job with the Spark UI + EMR metrics, identify the top 3 cost drivers (likely candidates: shuffle volume, skewed joins, instance-mix mistakes). Prototype the 3 optimizations on a 200 GB representative subset, measure cost + runtime impact, and write a 5-page memo recommending which to ship to production and in what order.
The Brief
What you'll do, and what you'll demonstrate.
Find and prove a 40 percent cost reduction on a 4TB nightly Spark job without breaking the 4-hour SLA.
Earning criteria — what you'll demonstrate
- Profile a real Spark job with the Spark UI and cloud-platform metrics
- Apply standard Spark optimizations (broadcast joins, partition tuning, instance mix)
- Build a defensible cost extrapolation from subset to full data
- Communicate cost trade-offs to finance + engineering stakeholders
Program Fit
Where this fits in your program.
Sharpens the same skills your degree expects you to demonstrate.
Skills
Skills you'll demonstrate.
Each one shows up on your verified credential.
Careers
Roles this prepares you for.
Real titles. Real skill bridges. Pick the one closest to your trajectory.
Data Engineer
Spark cost optimization on a real EMR workload is the kind of project a data engineer ships in the first quarter at any ad-tech or large-data company.
This challenge sharpens
- spark-optimization
- cost-engineering
- etl-pipelines
MLOps Engineer
Profiling and cost-optimizing large compute workloads is the same skillset MLOps engineers use to tame training-cluster bills.
This challenge sharpens
- profiling
- cloud-services
- benchmarking
AI Solutions Architect
Translating profiling + optimization into a finance-team-defensible recommendation is core AI solutions architect work.
This challenge sharpens
- cost-engineering
- cloud-services
- spark-optimization