Evaluate an Agent Suite on the SWE-Bench-Style Coding Benchmark
Overview
What this challenge is about.
You receive a sandboxed set of 50 small repo-modification tasks (test-passing as the success signal). Run 3 open-source agent frameworks (e.g., OpenHands, SWE-agent, and Aider) under identical compute and model budgets (Anthropic Claude or GPT-4-class). Report: pass@1, mean cost per task, mean wall-clock per task, and a failure-mode taxonomy across the 3 frameworks. Write a 5-page architecture-decision-record (ADR) that recommends one framework with explicit conditions under which the call would flip.
The Brief
What you'll do, and what you'll demonstrate.
Pick the open-source coding-agent framework that gives the org the best pass@1 per dollar on the benchmark, with an ADR the leadership can adopt.
Earning criteria — what you'll demonstrate
- Benchmark agent frameworks on a real coding task suite
- Reason about agent cost and latency, not just accuracy
- Author an ADR that survives technical leadership review
- Diagnose where agent failures actually come from (planning vs. tool-use vs. model)
Program Fit
Where this fits in your program.
Sharpens the same skills your degree expects you to demonstrate.
Skills
Skills you'll demonstrate.
Each one shows up on your verified credential.
Careers
Roles this prepares you for.
Real titles. Real skill bridges. Pick the one closest to your trajectory.
AI Engineer
Cross-framework agent benchmarking with an ADR-quality writeup is the kind of project that signals senior-AI-engineer judgment in interviews.
This challenge sharpens
- llm-agents
- agent-evaluation
- benchmarking