Live Workshop Blueprint · AI Coding Tools Benchmark

CODING AGENTS
CAGE MATCH

4 TOOLS · 1 TASK · REAL SCORES · LOCK TESTS BEFORE CLOCK STARTS

AI CODING WORKSHOP BENCHMARKING FRONTIER VS LOCAL LIVE SCORING

CONTESTANT ROSTER

Claude Code

ANTHROPIC

CLI ⭐ 131k

Context

1M tokens

SWE-bench

88.6%^[1]

Input /MTok

$5.00^[2]

Model

Opus 4.8

Cursor

ANYSPHERE

IDE

Autocomplete

72% accept^[3]

DX Rank

#1 IDE

Model

Sonnet / GPT

Price

$20/mo

GitHub Copilot

MICROSOFT / GITHUB

Extension

Model

GPT-4.1 / Sonnet

MCP servers

10K+^[4]

Price

$10/mo

Audience

Enterprise

Codex CLI

OPENAI

CLI ⭐ 90k

Context

128K tokens

85.0%^[1]

SWE-bench

Model

GPT-5.3 Codex

Price

API / token

⚡ OPTIONAL 5th TEAM — LOCAL MODEL

Running a local-model team (Qwen3-Coder-480B-A35B ⭐ 16.6k via Ollama^[5]) requires an RTX 4090 or Mac M3 Ultra — not a standard conference laptop. Frontier-vs-local gap on hard coding tasks: ~27 points SWE-bench (88.6% closed vs 60.5% best open-weight Nemotron^[6]). Cost crossover is not reached in a single 30-min workshop session — local wins on privacy and latency optics, not raw cost.^[7]

SCORING RUBRIC

40%

25%

15%

10%

Correctness

40%

Completeness

25%

Code Quality

15%

Edge Cases

10%

Speed Bonus

10%

Post-event pass@3 sweep

Live score = initial signal only. Each team runs again twice after the audience leaves. Only pass@3 separates fluky passes from reliably capable tools — that sweep produces the publishable comparison table.^[8]

CRITICAL CONSTRAINTS

CONSTRAINT 01 · BENCHMARK INTEGRITY

Lock the tests before the clock starts

Use a repo forked after all tools' training cutoffs. Lock tests/ before any agent touches the codebase — the cage match's core integrity rule that vendors cannot game.^[9]

SWE-bench contamination: 35-point gap on Claude Opus 4.5

CONSTRAINT 02 · STATISTICAL VALIDITY

Results are directional, not definitive

A live event is a single trial. For the 4-tool axis, within-tool pass@1 variance may exceed the between-tool gap. Announce this framing upfront so the audience doesn't over-index on who "won" in the room.^[8]

pass@3 sweep post-event = publishable data

CONSTRAINT 03 · #1 LIVE FAILURE MODE

Spec misalignment beats model quality

41.86% of all coding agent failures trace to agents misunderstanding requirements — not model incapability.^[10] Put a REQUIREMENTS.md in the repo. Use SMART spec. Define 3 explicit milestones.

41.86% of failures = spec misalignment

BENCHMARK INTEL · SWE-BENCH VERIFIED

★

Claude Mythos Preview

closed · preview

93.9%^[1]

Claude Opus 4.8

closed · cage match tool

88.6%

Claude Opus 4.7 (Adaptive)

closed

87.6%

GPT-5.3 Codex

closed · cage match tool

85.0%

▼ OPEN-WEIGHT FRONTIER ▼

—

DeepSeek V4 Pro

open-weight · MIT · 1.6T MoE (49B active)

80.6%^[11]

—

Kimi K2.6

open-weight · Moonshot AI · 1T MoE

80.2%^[12]

—

Nemotron 3 Super 120B

open-weight · best local cage-match pick

60.5%^[6]

27pt

frontier-vs-local gap on hard coding tasks
Epoch AI: open-weight lags closed frontier by ~4 months^[13]

⚠ CONTAMINATION WARNING

Claude Opus 4.5: 80.9% Verified → 45.9% SWE-bench Pro (35-point gap). Honest uncontaminated ceiling ≈ 69%.^[6] HumanEval & MBPP fully saturated, no longer discriminate between models.^[14]

RESEARCH BRIEFINGS · 5 THREADS

survey

Tool Capabilities & 2026 State-of-the-Art

7 tools dominating AI-assisted coding in mid-2026: architectures, benchmark scores, pricing, and a pick-your-stack guide.

18 citations · 6 min

survey

Task Design & Success Criteria

SMART spec, 3-milestone structure, pass@k eval methodology, and the scoring rubric — built for a timed live-audience head-to-head.

15 citations · 5 min

expedition

Frontier vs Local Model Shootout

Open weights trail closed frontier by ~4 months on Epoch's index & ~6 pts on Artificial Analysis; hardware costs, cost crossover, and where local still wins.