Design a Continuous Eval Pipeline for an Enterprise RAG Product
Overview
What this challenge is about.
Design (and partially build) a continuous-eval pipeline for a RAG system: (1) a structured eval set with at least 50 queries grouped by query class; (2) automated scoring (LLM-as-judge plus a smaller exact-match component) for answer accuracy, citation correctness, and hallucination rate; (3) a dashboard view (Streamlit or similar) showing scores over the last N deploys; (4) an alerting threshold definition for when to block a deploy. Build a working slice on around 200 public legal-policy documents (e.g., EU regulations from EUR-Lex). Produce a 3-page customer-facing commitment document plus an internal engineering proposal.
The Brief
What you'll do, and what you'll demonstrate.
Design and build a working slice of a continuous-eval pipeline for an enterprise RAG product, plus a customer-facing commitment document.
Earning criteria — what you'll demonstrate
- Design an eval set with realistic query-class coverage for RAG
- Combine LLM-as-judge with deterministic checks for honest scoring
- Build a continuous-eval pipeline architecture
- Translate eval commitments into customer-facing prose
Program Fit
Where this fits in your program.
Sharpens the same skills your degree expects you to demonstrate.
Skills
Skills you'll demonstrate.
Each one shows up on your verified credential.
Careers
Roles this prepares you for.
Real titles. Real skill bridges. Pick the one closest to your trajectory.
AI Engineer
Building the working slice end-to-end is the AI engineer's bread and butter at any RAG-shipping team.
This challenge sharpens
- retrieval-augmented-generation
- python
- llm-evaluation
Prompt Engineer
LLM-as-judge prompt design with validation is exactly the prompt engineer's contribution to a serious eval pipeline.
This challenge sharpens
- llm-evaluation
- continuous-evaluation
- stakeholder-communication