Cut Latency and Cost on a High-Volume Summarization Service
Overview
What this challenge is about.
You receive 30 days of anonymized request logs (prompt token counts, completion token counts, latencies, models used). Profile the cost and latency distribution, then design and benchmark four optimizations: (1) prompt compression / system-prompt slimming, (2) routing short articles to a smaller model, (3) request batching where applicable, (4) cache for duplicate articles. Validate quality with a 200-article LLM-as-judge eval (calibrated against 30 human ratings). Deliver: benchmark notebook, recommended changes (PR-style), and a 4-page before/after write-up.
The Brief
What you'll do, and what you'll demonstrate.
Cut LLM cost 30% and p95 latency to under 1.8 s on a news-summarization service without losing quality.
Earning criteria — what you'll demonstrate
- Profile LLM cost and latency distributions from real logs
- Apply prompt compression, model tiering, and caching as cost levers
- Calibrate LLM-as-judge against human ratings
- Communicate optimization trade-offs to product stakeholders
Program Fit
Where this fits in your program.
Sharpens the same skills your degree expects you to demonstrate.
Skills
Skills you'll demonstrate.
Each one shows up on your verified credential.
Careers
Roles this prepares you for.
Real titles. Real skill bridges. Pick the one closest to your trajectory.
AI Engineer
Profiling, optimizing, and shipping cost/latency wins on a real LLM service is the day-to-day of AI engineers at scaling AI products.
This challenge sharpens
- cost-optimization
- latency-optimization
- prompt-compression
MLOps Engineer
Model tiering and caching at request-level is core MLOps work on inference platforms.
This challenge sharpens
- model-tiering
- response-caching
- cost-optimization
AI Product Manager
Owning the quality-vs-cost trade-off and the board-facing write-up is the AI PM's daily job.
This challenge sharpens
- cost-optimization
- llm-evaluation
- model-tiering