Skip to contentSkip to content
Verified credentials. On-chain. Forever.Learn more
Cover image for Design a Capability Evaluation for an Open-Weights Coding Model
Research

Design a Capability Evaluation for an Open-Weights Coding Model

FreeVerified credential3 weeksAdvanced

Overview

What this challenge is about.

Pick a recent open-weights coding model (e.g., a Qwen, DeepSeek, or Llama variant). Design an evaluation set of around 40 coding tasks across 4 buckets: standard benign coding, dual-use security-adjacent coding, refusal-bait coding, and 'capability ceiling' hard coding. Run the model with documented decoding parameters. Score on pass rate and refusal rate per bucket. Report results with bootstrap intervals. Produce a 6-page public report including methodology, results, limitations, and 3 policy-relevant observations. Maintain a strict no-harm policy: dual-use tasks must be drawn from already-published evaluation sets, not invented.

CredentialBlockchain-anchored
ShareableLinkedIn-ready
LanguageEnglish
PaceSelf-paced

The Brief

What you'll do, and what you'll demonstrate.

Run a documented capability evaluation of an open-weights coding model across benign, dual-use, refusal-bait, and hard buckets.

Earning criteria — what you'll demonstrate

  • Design a multi-bucket capability evaluation with refusal tracking
  • Use only public, ethics-safe task sources for dual-use buckets
  • Report capability results with statistical honesty
  • Translate evaluation results into policy-relevant observations

Program Fit

Where this fits in your program.

Sharpens the same skills your degree expects you to demonstrate.

Skills

Skills you'll demonstrate.

Each one shows up on your verified credential.

Careers

Roles this prepares you for.

Real titles. Real skill bridges. Pick the one closest to your trajectory.

AI Safety Researcher

Public capability evaluations with policy framing are exactly the role's contribution to the AI governance conversation.

This challenge sharpens

  • capability-evaluation
  • safety-evaluation
  • policy-communication

ML Researcher

Designing a multi-bucket evaluation set with rigorous statistics is bread-and-butter ML research work.

This challenge sharpens

  • llm-evaluation
  • capability-evaluation
  • research-writing

Applied AI Scientist

Translating evaluations into policy-relevant observations is the applied-AI bridge into policy and governance teams.

This challenge sharpens

  • llm-evaluation
  • policy-communication
  • research-writing

One more thing

You can put a credential on your CV by Friday.