Overview
What this challenge is about.
Receive a 90-day RED (Rate, Errors, Duration) metrics export for the subscriber API across 6 endpoints and 38 weeks of paging history. Define an SLO per endpoint (e.g., 99.9 percent availability over 30 days, p95 latency under 220ms). Compute error budgets and implement multi-window, multi-burn-rate alerts (Google SRE workbook chapter 5 patterns). Author a Prometheus alert rule library and Alertmanager routing tree, with severity tiers tied to burn-rate speed. Deliver: 12-page SLO catalog, alert rule library (YAML), 8-page on-call runbook, and a 4-week evaluation with paging-volume and false-positive rates before/after.
The Brief
What you'll do, and what you'll demonstrate.
Redesign alerting around SLOs and multi-window burn-rate alerts to cut non-actionable pages in half without missing real incidents.
Earning criteria — what you'll demonstrate
- Define SLOs grounded in user-experience signals, not infrastructure metrics
- Compute error budgets and burn-rate alerts at multiple windows
- Author Prometheus + Alertmanager rule libraries that survive review
- Evaluate alerting changes against actual paging history honestly
Program Fit
Where this fits in your program.
Sharpens the same skills your degree expects you to demonstrate.
Skills
Skills you'll demonstrate.
Each one shows up on your verified credential.
Careers
Roles this prepares you for.
Real titles. Real skill bridges. Pick the one closest to your trajectory.
Career mappings coming soon.