Design a Capability Evaluation for an Open-Weights Coding Model
Overview
What this challenge is about.
Pick a recent open-weights coding model (e.g., a Qwen, DeepSeek, or Llama variant). Design an evaluation set of around 40 coding tasks across 4 buckets: standard benign coding, dual-use security-adjacent coding, refusal-bait coding, and 'capability ceiling' hard coding. Run the model with documented decoding parameters. Score on pass rate and refusal rate per bucket. Report results with bootstrap intervals. Produce a 6-page public report including methodology, results, limitations, and 3 policy-relevant observations. Maintain a strict no-harm policy: dual-use tasks must be drawn from already-published evaluation sets, not invented.
The Brief
What you'll do, and what you'll demonstrate.
Run a documented capability evaluation of an open-weights coding model across benign, dual-use, refusal-bait, and hard buckets.
Earning criteria — what you'll demonstrate
- Design a multi-bucket capability evaluation with refusal tracking
- Use only public, ethics-safe task sources for dual-use buckets
- Report capability results with statistical honesty
- Translate evaluation results into policy-relevant observations
Program Fit
Where this fits in your program.
Sharpens the same skills your degree expects you to demonstrate.
Skills
Skills you'll demonstrate.
Each one shows up on your verified credential.
Careers
Roles this prepares you for.
Real titles. Real skill bridges. Pick the one closest to your trajectory.
AI Safety Researcher
Public capability evaluations with policy framing are exactly the role's contribution to the AI governance conversation.
This challenge sharpens
- capability-evaluation
- safety-evaluation
- policy-communication
ML Researcher
Designing a multi-bucket evaluation set with rigorous statistics is bread-and-butter ML research work.
This challenge sharpens
- llm-evaluation
- capability-evaluation
- research-writing
Applied AI Scientist
Translating evaluations into policy-relevant observations is the applied-AI bridge into policy and governance teams.
This challenge sharpens
- llm-evaluation
- policy-communication
- research-writing