Overview
What this challenge is about.
Design and prototype: (1) a primary-region deployment of the RAG service (vector DB + LLM inference + retrieval API), (2) a passive secondary region with replicated vector store, (3) a documented failover procedure with a clear Recovery Time Objective (RTO) and Recovery Point Objective (RPO). Run a chaos test (simulate primary outage) and measure actual RTO. Deliver Terraform infrastructure-as-code, a 5-page failover runbook, and a chaos-test report.
The Brief
What you'll do, and what you'll demonstrate.
Design and prove a multi-region failover for a RAG service that achieves a measured RTO under 30 minutes and RPO under 5 minutes.
Earning criteria — what you'll demonstrate
- Design a multi-region active-passive architecture for an AI service
- Reason about RTO/RPO targets for stateful AI workloads (vector stores)
- Run a chaos test that proves the design under failure
- Author an SRE runbook that holds up at 3am
Program Fit
Where this fits in your program.
Sharpens the same skills your degree expects you to demonstrate.
Skills
Skills you'll demonstrate.
Each one shows up on your verified credential.
Careers
Roles this prepares you for.
Real titles. Real skill bridges. Pick the one closest to your trajectory.
AI Solutions Architect
Multi-region failover design for stateful AI workloads is the kind of project that defines an AI solutions architect's first year at an enterprise-AI company.
This challenge sharpens
- multi-region-architecture
- disaster-recovery
- vector-databases
MLOps Engineer
Chaos engineering and SRE-ready runbooks are MLOps disciplines for keeping AI services up under real-world stress.
This challenge sharpens
- chaos-engineering
- infrastructure-as-code
- cloud-services
AI Engineer
Stateful RAG architecture across regions is the AI-engineer-meets-platform skillset that early-stage AI companies hire for.
This challenge sharpens
- vector-databases
- cloud-services
- disaster-recovery