Design a Capability Evaluation for an Open-Weights Coding Model

FreeVerified credential3 weeksAdvanced

Overview

What this challenge is about.

Pick a recent open-weights coding model (e.g., a Qwen, DeepSeek, or Llama variant). Design an evaluation set of around 40 coding tasks across 4 buckets: standard benign coding, dual-use security-adjacent coding, refusal-bait coding, and 'capability ceiling' hard coding. Run the model with documented decoding parameters. Score on pass rate and refusal rate per bucket. Report results with bootstrap intervals. Produce a 6-page public report including methodology, results, limitations, and 3 policy-relevant observations. Maintain a strict no-harm policy: dual-use tasks must be drawn from already-published evaluation sets, not invented.

CredentialBlockchain-anchored

ShareableLinkedIn-ready

LanguageEnglish

PaceSelf-paced

The Brief

What you'll do, and what you'll demonstrate.

Run a documented capability evaluation of an open-weights coding model across benign, dual-use, refusal-bait, and hard buckets.

Earning criteria — what you'll demonstrate

Design a multi-bucket capability evaluation with refusal tracking
Use only public, ethics-safe task sources for dual-use buckets
Report capability results with statistical honesty
Translate evaluation results into policy-relevant observations

Program Fit

Where this fits in your program.

Sharpens the same skills your degree expects you to demonstrate.

AI Safety and Alignment

Master · Ai Ml

Fit score: 1

Skills

Skills you'll demonstrate.

Each one shows up on your verified credential.

Careers

Roles this prepares you for.

Real titles. Real skill bridges. Pick the one closest to your trajectory.

Career paths this builds toward

Canonical roles

AI Safety Researcher
AI Research

AI Safety Researcher

Public capability evaluations with policy framing are exactly the role's contribution to the AI governance conversation.

This challenge sharpens

capability-evaluation
safety-evaluation
policy-communication

ML Researcher

Designing a multi-bucket evaluation set with rigorous statistics is bread-and-butter ML research work.

This challenge sharpens

llm-evaluation
capability-evaluation
research-writing

Applied AI Scientist

Translating evaluations into policy-relevant observations is the applied-AI bridge into policy and governance teams.

This challenge sharpens

llm-evaluation
policy-communication
research-writing

One more thing

You can put a credential on your CV by Friday.

Start this challenge