Skip to contentSkip to content
Verified credentials. On-chain. Forever.Learn more
Cover image for Evaluate an Agent Suite on the SWE-Bench-Style Coding Benchmark
Analysis

Evaluate an Agent Suite on the SWE-Bench-Style Coding Benchmark

FreeVerified credential2 weeksAdvanced

Overview

What this challenge is about.

You receive a sandboxed set of 50 small repo-modification tasks (test-passing as the success signal). Run 3 open-source agent frameworks (e.g., OpenHands, SWE-agent, and Aider) under identical compute and model budgets (Anthropic Claude or GPT-4-class). Report: pass@1, mean cost per task, mean wall-clock per task, and a failure-mode taxonomy across the 3 frameworks. Write a 5-page architecture-decision-record (ADR) that recommends one framework with explicit conditions under which the call would flip.

CredentialBlockchain-anchored
ShareableLinkedIn-ready
LanguageEnglish
PaceSelf-paced

The Brief

What you'll do, and what you'll demonstrate.

Pick the open-source coding-agent framework that gives the org the best pass@1 per dollar on the benchmark, with an ADR the leadership can adopt.

Earning criteria — what you'll demonstrate

  • Benchmark agent frameworks on a real coding task suite
  • Reason about agent cost and latency, not just accuracy
  • Author an ADR that survives technical leadership review
  • Diagnose where agent failures actually come from (planning vs. tool-use vs. model)

Program Fit

Where this fits in your program.

Sharpens the same skills your degree expects you to demonstrate.

Skills

Skills you'll demonstrate.

Each one shows up on your verified credential.

Careers

Roles this prepares you for.

Real titles. Real skill bridges. Pick the one closest to your trajectory.

AI Engineer

Cross-framework agent benchmarking with an ADR-quality writeup is the kind of project that signals senior-AI-engineer judgment in interviews.

This challenge sharpens

  • llm-agents
  • agent-evaluation
  • benchmarking

One more thing

You can put a credential on your CV by Friday.

Evaluate an Agent Suite on the SWE-Bench-Style Coding Benchmark | Ewance Challenge