Overview
What this challenge is about.
Profile the current usage (24-hour trace, per-team breakdown). Pick a cost-optimization mix from: time-based autoscaling, spot/preemptible instances with graceful drain, smarter continuous batching (vLLM tuning), KV-cache aware request routing, model quantization to FP8 or AWQ, and request-class-based routing (cheap model for short queries). Prototype the top two on a small replica cluster. Validate SLA (p99 latency under 600ms) holds. Deliver a 4-page memo with projected USD savings and a 90-day rollout plan.
The Brief
What you'll do, and what you'll demonstrate.
Cut LLM cluster cost by 30%+ via a prototyped optimization mix, without breaking the p99 latency SLA.
Earning criteria — what you'll demonstrate
- Profile real LLM-API usage to find cost-optimization levers
- Apply autoscaling, batching, and routing techniques to LLM serving
- Prove cost wins without breaking latency SLAs
- Translate engineering wins into a CFO-readable savings story
Program Fit
Where this fits in your program.
Sharpens the same skills your degree expects you to demonstrate.
Skills
Skills you'll demonstrate.
Each one shows up on your verified credential.
Careers
Roles this prepares you for.
Real titles. Real skill bridges. Pick the one closest to your trajectory.
MLOps Engineer
Cost-optimizing LLM serving while holding SLAs is the platform-MLOps work that every AI startup eventually leans on once the cloud bill outgrows the COGS line.
This challenge sharpens
- llm-serving
- autoscaling
- cost-optimization
AI Engineer
Hands-on vLLM + Ray tuning is the AI-engineer skill set that startups hire for when they want one person to own model serving end to end.
This challenge sharpens
- vllm
- ray
- llm-serving
AI Solutions Architect
Designing the LLM serving topology and the cost-vs-SLA rollout plan is core AI solutions architecture work at any cloud provider or AI consultancy.
This challenge sharpens
- llm-serving
- kubernetes
- cost-optimization