RAG Faithfulness Evaluation for a Medical-Education Assistant

FreeVerified credential2 weeksAdvanced

Overview

What this challenge is about.

You receive 200 student-style questions, two RAG configurations (config A: vector-only + GPT-class generator; config B: hybrid + rerank + GPT-class generator), and the medical-textbook corpus they retrieve from. Build a faithfulness eval harness with three methods: (1) LLM-judge using a careful prompt with claim decomposition, (2) NLI-entailment classifier per claim (e.g., bart-large-mnli), and (3) manual rubric scoring on 30 questions by you. Run both configs through all three methods. Report per-method scores, inter-method agreement, and per-question disagreements. Recommend one config plus an ongoing eval cadence.

CredentialBlockchain-anchored

ShareableLinkedIn-ready

LanguageEnglish

PaceSelf-paced

The Brief

What you'll do, and what you'll demonstrate.

Build a multi-method faithfulness eval that lets a medical advisory board sign off on a RAG study assistant.

Earning criteria — what you'll demonstrate

Design a multi-method faithfulness evaluation for RAG outputs
Implement claim decomposition for fine-grained scoring
Reason about LLM-judge bias and triangulate with non-LLM methods
Translate evaluation results into a non-ML advisory board memo

Program Fit

Where this fits in your program.

Sharpens the same skills your degree expects you to demonstrate.

Retrieval-Augmented Generation

Master · Ai Ml

Fit score: 1

Skills

Skills you'll demonstrate.

Each one shows up on your verified credential.

Careers

Roles this prepares you for.

Real titles. Real skill bridges. Pick the one closest to your trajectory.

Career paths this builds toward

Canonical roles

AI Safety Researcher
AI Research

AI Safety Researcher

Multi-method faithfulness evaluation with claim decomposition is exactly the eval work safety researchers do on production LLM systems.

This challenge sharpens

faithfulness
llm-as-judge
evaluation-harness

AI Engineer

Standing up a reusable RAG eval harness is core AI-engineer infrastructure work in any RAG product team.

This challenge sharpens

rag-evaluation
evaluation-harness
python

Applied AI Scientist

Triangulating LLM-judge with entailment and manual scoring is the kind of methodological rigor applied AI scientists bring to high-stakes deployments.

This challenge sharpens

llm-as-judge
entailment
faithfulness

One more thing

You can put a credential on your CV by Friday.

Start this challenge