Build an Evaluation Harness for an Internal LLM Assistant

FreeVerified credential3 weeksAdvanced

Overview

What this challenge is about.

You will design and implement an evaluation harness in Python that runs four test suites: (1) helpfulness (LLM-as-judge with rubric), (2) factual grounding (compare cited sources to retrieved sources), (3) refusal of harmful content (use a small public harmful-prompt benchmark), (4) prompt-injection resistance (curated attacks). Populate ~120 cases per suite. Run it against an open-weight model (e.g., Qwen2.5 14B) and a frontier-API model. Deliver: harness code, test cases, scored results, and a 4-page selection memo with caveats and re-run instructions.

CredentialBlockchain-anchored

ShareableLinkedIn-ready

LanguageEnglish

PaceSelf-paced

The Brief

What you'll do, and what you'll demonstrate.

Build a reusable LLM evaluation harness that covers helpfulness, grounding, refusal, and prompt-injection resistance, and use it to pick a base model.

Earning criteria — what you'll demonstrate

Design an evaluation harness that covers safety and quality dimensions
Apply LLM-as-judge with rubrics and inter-rater calibration
Test for prompt injection with a meaningful threat model
Communicate evaluation results as a model-selection decision

Program Fit

Where this fits in your program.

Sharpens the same skills your degree expects you to demonstrate.

Large Language Models

Master · Ai Ml

Fit score: 1

Skills

Skills you'll demonstrate.

Each one shows up on your verified credential.

Careers

Roles this prepares you for.

Real titles. Real skill bridges. Pick the one closest to your trajectory.

Career paths this builds toward

Canonical roles

AI Safety Researcher
AI Research

AI Safety Researcher

Building a multi-dimensional LLM evaluation harness is core safety-research work at any enterprise-AI vendor.

This challenge sharpens

llm-evaluation
prompt-injection-testing
grounding-evaluation

ML Researcher

Designing test cases and judge calibration is the methodological core of LLM-as-judge research.

This challenge sharpens

llm-as-judge
benchmark-design
llm-evaluation

AI Engineer

Wiring a reusable evaluation harness into the engagement workflow is the AI-engineer skillset that consultancies hire for.

This challenge sharpens

python
llm-evaluation
benchmark-design

One more thing

You can put a credential on your CV by Friday.

Start this challenge