← Default view
⚔ CAGE MATCH 2026
JUNE 9, 2026  ·  30-MINUTE TIMED RUN  ·  WORKSHOP BLUEPRINT
103 CITATIONS $8.44 5 THREADS
Live Workshop Blueprint · AI Coding Tools Benchmark

CODING AGENTS
CAGE MATCH

4 TOOLS  ·  1 TASK  ·  REAL SCORES  ·  LOCK TESTS BEFORE CLOCK STARTS
AI CODING WORKSHOP BENCHMARKING FRONTIER VS LOCAL LIVE SCORING
CONTESTANT ROSTER
Claude Code — Anthropic
ANTHROPIC
CLI ⭐ 131k
Context
1M tokens
SWE-bench
88.6%[1]
Input /MTok
$5.00[2]
Model
Opus 4.8
Cursor — Anysphere
ANYSPHERE
IDE
Autocomplete
72% accept[3]
DX Rank
#1 IDE
Model
Sonnet / GPT
Price
$20/mo
GitHub Copilot — Microsoft
MICROSOFT / GITHUB
Extension
Model
GPT-4.1 / Sonnet
MCP servers
10K+[4]
Price
$10/mo
Audience
Enterprise
Codex CLI — OpenAI
OPENAI
CLI ⭐ 90k
Context
128K tokens
85.0%[1]
SWE-bench
Model
GPT-5.3 Codex
Price
API / token
⚡ OPTIONAL 5th TEAM — LOCAL MODEL
Running a local-model team (Qwen3-Coder-480B-A35B ⭐ 16.6k via Ollama[5]) requires an RTX 4090 or Mac M3 Ultra — not a standard conference laptop. Frontier-vs-local gap on hard coding tasks: ~27 points SWE-bench (88.6% closed vs 60.5% best open-weight Nemotron[6]). Cost crossover is not reached in a single 30-min workshop session — local wins on privacy and latency optics, not raw cost.[7]
SCORING RUBRIC
40%
25%
15%
10%
10%
Correctness
40%
Completeness
25%
Code Quality
15%
Edge Cases
10%
Speed Bonus
10%
Post-event pass@3 sweep
Live score = initial signal only. Each team runs again twice after the audience leaves. Only pass@3 separates fluky passes from reliably capable tools — that sweep produces the publishable comparison table.[8]
CRITICAL CONSTRAINTS
CONSTRAINT 01 · BENCHMARK INTEGRITY
Lock the tests before the clock starts
Use a repo forked after all tools' training cutoffs. Lock tests/ before any agent touches the codebase — the cage match's core integrity rule that vendors cannot game.[9]
SWE-bench contamination: 35-point gap on Claude Opus 4.5
CONSTRAINT 02 · STATISTICAL VALIDITY
Results are directional, not definitive
A live event is a single trial. For the 4-tool axis, within-tool pass@1 variance may exceed the between-tool gap. Announce this framing upfront so the audience doesn't over-index on who "won" in the room.[8]
pass@3 sweep post-event = publishable data
CONSTRAINT 03 · #1 LIVE FAILURE MODE
Spec misalignment beats model quality
41.86% of all coding agent failures trace to agents misunderstanding requirements — not model incapability.[10] Put a REQUIREMENTS.md in the repo. Use SMART spec. Define 3 explicit milestones.
41.86% of failures = spec misalignment
BENCHMARK INTEL · SWE-BENCH VERIFIED
Claude Mythos Preview
closed · preview
93.9%[1]
2
Claude Opus 4.8
closed · cage match tool
88.6%
3
Claude Opus 4.7 (Adaptive)
closed
87.6%
4
GPT-5.3 Codex
closed · cage match tool
85.0%
▼ OPEN-WEIGHT FRONTIER ▼
DeepSeek V4 Pro
open-weight · MIT · 1.6T MoE (49B active)
80.6%[11]
Kimi K2.6
open-weight · Moonshot AI · 1T MoE
80.2%[12]
Nemotron 3 Super 120B
open-weight · best local cage-match pick
60.5%[6]
27pt
frontier-vs-local gap on hard coding tasks
Epoch AI: open-weight lags closed frontier by ~4 months[13]
⚠ CONTAMINATION WARNING
Claude Opus 4.5: 80.9% Verified → 45.9% SWE-bench Pro (35-point gap). Honest uncontaminated ceiling ≈ 69%.[6] HumanEval & MBPP fully saturated, no longer discriminate between models.[14]