Overview
What this challenge is about.
You will design and implement an evaluation harness in Python that runs four test suites: (1) helpfulness (LLM-as-judge with rubric), (2) factual grounding (compare cited sources to retrieved sources), (3) refusal of harmful content (use a small public harmful-prompt benchmark), (4) prompt-injection resistance (curated attacks). Populate ~120 cases per suite. Run it against an open-weight model (e.g., Qwen2.5 14B) and a frontier-API model. Deliver: harness code, test cases, scored results, and a 4-page selection memo with caveats and re-run instructions.
The Brief
What you'll do, and what you'll demonstrate.
Build a reusable LLM evaluation harness that covers helpfulness, grounding, refusal, and prompt-injection resistance, and use it to pick a base model.
Earning criteria — what you'll demonstrate
- Design an evaluation harness that covers safety and quality dimensions
- Apply LLM-as-judge with rubrics and inter-rater calibration
- Test for prompt injection with a meaningful threat model
- Communicate evaluation results as a model-selection decision
Program Fit
Where this fits in your program.
Sharpens the same skills your degree expects you to demonstrate.
Skills
Skills you'll demonstrate.
Each one shows up on your verified credential.
Careers
Roles this prepares you for.
Real titles. Real skill bridges. Pick the one closest to your trajectory.
AI Safety Researcher
Building a multi-dimensional LLM evaluation harness is core safety-research work at any enterprise-AI vendor.
This challenge sharpens
- llm-evaluation
- prompt-injection-testing
- grounding-evaluation
ML Researcher
Designing test cases and judge calibration is the methodological core of LLM-as-judge research.
This challenge sharpens
- llm-as-judge
- benchmark-design
- llm-evaluation
AI Engineer
Wiring a reusable evaluation harness into the engagement workflow is the AI-engineer skillset that consultancies hire for.
This challenge sharpens
- python
- llm-evaluation
- benchmark-design