RAG Faithfulness Evaluation for a Medical-Education Assistant
Overview
What this challenge is about.
You receive 200 student-style questions, two RAG configurations (config A: vector-only + GPT-class generator; config B: hybrid + rerank + GPT-class generator), and the medical-textbook corpus they retrieve from. Build a faithfulness eval harness with three methods: (1) LLM-judge using a careful prompt with claim decomposition, (2) NLI-entailment classifier per claim (e.g., bart-large-mnli), and (3) manual rubric scoring on 30 questions by you. Run both configs through all three methods. Report per-method scores, inter-method agreement, and per-question disagreements. Recommend one config plus an ongoing eval cadence.
The Brief
What you'll do, and what you'll demonstrate.
Build a multi-method faithfulness eval that lets a medical advisory board sign off on a RAG study assistant.
Earning criteria — what you'll demonstrate
- Design a multi-method faithfulness evaluation for RAG outputs
- Implement claim decomposition for fine-grained scoring
- Reason about LLM-judge bias and triangulate with non-LLM methods
- Translate evaluation results into a non-ML advisory board memo
Program Fit
Where this fits in your program.
Sharpens the same skills your degree expects you to demonstrate.
Skills
Skills you'll demonstrate.
Each one shows up on your verified credential.
Careers
Roles this prepares you for.
Real titles. Real skill bridges. Pick the one closest to your trajectory.
AI Safety Researcher
Multi-method faithfulness evaluation with claim decomposition is exactly the eval work safety researchers do on production LLM systems.
This challenge sharpens
- faithfulness
- llm-as-judge
- evaluation-harness
AI Engineer
Standing up a reusable RAG eval harness is core AI-engineer infrastructure work in any RAG product team.
This challenge sharpens
- rag-evaluation
- evaluation-harness
- python
Applied AI Scientist
Triangulating LLM-judge with entailment and manual scoring is the kind of methodological rigor applied AI scientists bring to high-stakes deployments.
This challenge sharpens
- llm-as-judge
- entailment
- faithfulness