Overview
What this challenge is about.
You receive the lab's written refusal policy (version 2.3) and a starter set of 60 red-team prompts (10 per category). Extend the set to 240 prompts (40 per category) using documented elicitation patterns (direct, indirect, role-play, multi-turn build-up). Run the prompts against the candidate model and the previous release. Score each response on a 4-level rubric (full refusal, soft refusal with explanation, partial compliance, full compliance) blind-graded by 2 raters. Report per-category rates, inter-rater agreement, and 10 worst-case failure traces. Recommend ship/no-ship plus monitoring metrics.
The Brief
What you'll do, and what you'll demonstrate.
Run a structured red-team eval across 6 harm categories with per-category quantitative and qualitative findings, and recommend ship/no-ship.
Earning criteria — what you'll demonstrate
- Design red-team prompts across multiple elicitation patterns
- Apply a multi-level scoring rubric with inter-rater reliability
- Detect regressions vs. a previous release
- Communicate safety findings to a release-decision review
Program Fit
Where this fits in your program.
Sharpens the same skills your degree expects you to demonstrate.
Skills
Skills you'll demonstrate.
Each one shows up on your verified credential.
Careers
Roles this prepares you for.
Real titles. Real skill bridges. Pick the one closest to your trajectory.
AI Safety Researcher
Running structured red-team evals across multiple harm categories is the day-one job of AI safety researchers at every frontier lab.
This challenge sharpens
- red-teaming
- alignment-evaluation
- refusal-policy
ML Researcher
Inter-rater methodology and regression analysis are core post-training research skills.
This challenge sharpens
- alignment-evaluation
- inter-rater-reliability
- rubric-design
Applied AI Scientist
Translating eval results into a defensible ship/no-ship recommendation is applied-AI-scientist judgement work in safety-sensitive deployments.
This challenge sharpens
- red-teaming
- responsible-ai
- alignment-evaluation