Overview
What this challenge is about.
You will design 240 probe prompts across 3 classes: (1) over-refusal (innocuous coding asks the model should fulfill), (2) insecure code patterns (asks where the model should warn about SQL injection, hardcoded secrets, etc.), (3) training-data leakage (oblique prompts attempting to elicit memorized snippets). Run them against the model API, hand-score outputs against a clear rubric, and write a 6-page red-team report following a model-card-friendly structure (executive summary, methodology, findings, severity table, recommendations).
The Brief
What you'll do, and what you'll demonstrate.
Probe a 14B coding assistant for over-refusal, insecure code generation, and data leakage, and publish a model-card-ready red-team report.
Earning criteria — what you'll demonstrate
- Design probe prompts for distinct alignment failure modes
- Apply consistent rubrics for hand-scoring LLM outputs
- Build a severity ranking that combines likelihood and impact
- Write a public-facing red-team report
Program Fit
Where this fits in your program.
Sharpens the same skills your degree expects you to demonstrate.
Skills
Skills you'll demonstrate.
Each one shows up on your verified credential.
Careers
Roles this prepares you for.
Real titles. Real skill bridges. Pick the one closest to your trajectory.
AI Safety Researcher
Designing and running an alignment red-team is the core day-to-day of safety researchers at any frontier AI lab.
This challenge sharpens
- red-teaming
- alignment-evaluation
- risk-assessment
ML Researcher
Disciplined probe design and inter-rater scoring is the methodological foundation of empirical LLM research.
This challenge sharpens
- prompt-design
- llm-evaluation
- alignment-evaluation
Research Scientist
Writing a model-card-ready red-team report is exactly the publishable output expected of a junior research scientist at an AI lab.
This challenge sharpens
- report-writing
- red-teaming
- alignment-evaluation