Red-Team Evaluation of a Refusal Policy

FreeVerified credential2 weeksAdvanced

Overview

What this challenge is about.

You receive the lab's written refusal policy (version 2.3) and a starter set of 60 red-team prompts (10 per category). Extend the set to 240 prompts (40 per category) using documented elicitation patterns (direct, indirect, role-play, multi-turn build-up). Run the prompts against the candidate model and the previous release. Score each response on a 4-level rubric (full refusal, soft refusal with explanation, partial compliance, full compliance) blind-graded by 2 raters. Report per-category rates, inter-rater agreement, and 10 worst-case failure traces. Recommend ship/no-ship plus monitoring metrics.

CredentialBlockchain-anchored

ShareableLinkedIn-ready

LanguageEnglish

PaceSelf-paced

The Brief

What you'll do, and what you'll demonstrate.

Run a structured red-team eval across 6 harm categories with per-category quantitative and qualitative findings, and recommend ship/no-ship.

Earning criteria — what you'll demonstrate

Design red-team prompts across multiple elicitation patterns
Apply a multi-level scoring rubric with inter-rater reliability
Detect regressions vs. a previous release
Communicate safety findings to a release-decision review

Program Fit

Where this fits in your program.

Sharpens the same skills your degree expects you to demonstrate.

Machine Learning from Human Preferences (RLHF and Alignment)

Master · Ai Ml

Fit score: 1

Skills

Skills you'll demonstrate.

Each one shows up on your verified credential.

Careers

Roles this prepares you for.

Real titles. Real skill bridges. Pick the one closest to your trajectory.

AI Safety Researcher

Running structured red-team evals across multiple harm categories is the day-one job of AI safety researchers at every frontier lab.

This challenge sharpens

red-teaming
alignment-evaluation
refusal-policy

ML Researcher

Inter-rater methodology and regression analysis are core post-training research skills.

This challenge sharpens

alignment-evaluation
inter-rater-reliability
rubric-design

Applied AI Scientist

Translating eval results into a defensible ship/no-ship recommendation is applied-AI-scientist judgement work in safety-sensitive deployments.

This challenge sharpens

red-teaming
responsible-ai
alignment-evaluation

One more thing

You can put a credential on your CV by Friday.

Start this challenge