Cost-Profile a Spark Job at Scale and Cut the Bill in Half

FreeVerified credential3 weeksAdvanced

Overview

What this challenge is about.

Receive the PySpark job (around 1,800 lines), 5 nights of Spark UI + EMR metrics, and the EMR cluster config. Profile to find the top 3 cost drivers (likely candidates: skewed joins on customer ID, oversized shuffles, instance-type misallocation, missing predicate pushdown). Prototype 3 optimizations on a 1TB representative subset and measure cost + runtime delta. Extrapolate to full scale with a defensible cost model. Run a full-scale dry-run on the best combination to confirm the 6-hour SLA. Deliver the profiling report, optimization branches, a 1TB benchmark table, the extrapolation model, and a 5-page recommendation memo with rollout order.

CredentialBlockchain-anchored

ShareableLinkedIn-ready

LanguageEnglish

PaceSelf-paced

The Brief

What you'll do, and what you'll demonstrate.

Cut a 6TB-per-day Spark job's bill in half without breaking the 6-hour SLA, and document the optimizations defensibly enough for finance to fund the engineer-time.

Earning criteria — what you'll demonstrate

Profile a real Spark job using Spark UI + cloud metrics
Apply standard Spark optimizations (broadcast joins, partition tuning, AQE)
Build a defensible cost extrapolation from subset to full scale
Communicate cost trade-offs to finance + engineering stakeholders

Program Fit

Where this fits in your program.

Sharpens the same skills your degree expects you to demonstrate.

Big Data and Data-Intensive Systems

Master · Cs Se

Fit score: 1

Skills

Skills you'll demonstrate.

Each one shows up on your verified credential.

Careers

Roles this prepares you for.

Real titles. Real skill bridges. Pick the one closest to your trajectory.

Career paths this builds toward

Canonical roles

Backend Engineer
Software Engineering

One more thing

You can put a credential on your CV by Friday.

Start this challenge