Decision. Default to a frontier API (Claude Opus 4.8, GPT-5.5, Gemini 3.1 Pro) for the hardest coding, long-horizon agents, and low-volume exploratory work — the gap is small but real and you pay nothing when idle [7][29]. Run a local open-weight model (Qwen3-Coder, DeepSeek V4, Kimi K2.6, GLM-5) when privacy/compliance forces it, when you sustain high volume (~500K+ tokens/day), or when you need full fine-tuning control [41][3][4]. The mature answer is hybrid: route 60-80% of traffic to local, escalate the hardest 20-40% to a frontier API [43].
The headline of mid-2026 is convergence. Epoch AI’s Capabilities Index puts the best open-weight models about four months and ~8 ECI points behind the closed frontier — and notably that lag widened slightly from ~3 months in late 2025, a caution against assuming the gap closes on its own [27][28]. On aggregate intelligence the gap is now noise-adjacent; on the hardest coding and long-horizon agentic work, frontier still wins clearly.
The frontier scoreboard
A four-way race. Scores from the Artificial Analysis Intelligence Index (June 2026) and SWE-bench Verified; pricing per million tokens (input/output).
| Model | AA Intelligence | SWE-bench Verified | Price in/out (std) | Notes |
|---|---|---|---|---|
| Claude Opus 4.8 | 61.4 [[7]] | 88.6% [[8]] | $5 / $25 [[10]] | Coding co-leader; Fast mode $10/$50 [[11]] |
| GPT-5.5 | 60.2 [[7]] | — | $5 / $30 [[12]] | 2× price jump over 5.4, but less verbose [[13]] |
| Gemini 3.1 Pro | 57 [[7]] | — | $2 / $12 [[16]] | Cheapest frontier-Pro tier [[16]] |
| Grok 4.3 | 53 [[14]] | ~74% [[15]] | $1.25 / $2.50 [[14]] | Value + agentic; 1M ctx; trails Opus by ~14pts [[15]] |
⚠ Treat SWE-bench Verified cautiously: contamination inflates it. Claude Opus 4.5 scores 80.9% on Verified but only 45.9% on the cleaner SWE-bench Pro — a 35-point gap that says models are partly remembering solutions [9]. Cross-vendor flagship pricing for reference: GPT-5.2 $1.75/$14 [40], Claude Sonnet 4.6 $3/$15, Haiku 4.5 $1/$5 [39], GPT-5.5 Pro $30/$180 [12].
The open-weight contenders
Chinese labs lead the open frontier; sizes are MoE (total / active params).
| Model | Params (total/active) | License | SWE-bench Verified | Local fit |
|---|---|---|---|---|
| DeepSeek V4 Pro | 1.6T / 49B | MIT [[18]] | 80.6% [[18]] | Datacenter only; V4-Flash 284B lighter |
| Kimi K2.6 | 1T / 32B | mod. MIT [[19]] | 80.2% [[19]] | Datacenter; agentic-coding tuned |
| GLM-5 / 5.1 | 744B / 40-44B | MIT [[20]] | ~94.6% of Opus 4.6 coding (vendor) [[20]] | Multi-GPU; trained on Ascend silicon |
| Qwen3-Coder ⭐ 16.6k | 480B/35B & 30B/3B | Apache-2.0 [[22]] | ≈ Claude Sonnet [[22]] | 30B-A3B runs on a single 24GB GPU |
| Qwen3.6-27B | 27B dense | Apache-2.0 [[23]] | beats 397B on coding [[23]] | Flagship coding on a consumer GPU |
| gpt-oss-120b | 120B / 5.1B | Apache-2.0 [[21]] | reasoning-focused [[21]] | Fits 80GB VRAM (MXFP4) [[21]] |
| Gemma 4 26B | 26B / 3.8B | Apache-2.0 [[24]] | — | Top 24GB-VRAM pick [[24]] |
| Mistral Small 4 | 119B / ~6B | Apache-2.0 [[26]] | — | Single consumer GPU w/ quant [[26]] |
| Llama 4 Scout | 109B / 17B | Llama (restrictive) [[25]] | — | 32GB+ unified mem; EU-limited [[25]] |
Overall open-weight ranking (benchlm.ai, mid-2026): DeepSeek V4 Pro 87, Kimi K2.6 84, GLM-5.1 83, Qwen3.5 79 [17]. Phi-4 14B (MIT, 16GB) and DeepSeek-R1 distills (1.5B–70B) fill the small-model tiers [24].
How wide is the gap?
It depends entirely on which benchmark you look at — narrow on aggregate, wider on hard coding.
| Benchmark | Frontier leader | Best open weight | Gap |
|---|---|---|---|
| Epoch Capabilities Index (time) | closed frontier | ~4 months behind [[28]] | widening [[28]] |
| AA Intelligence Index (Apr) | 57 (Opus 4.7 / Gem 3.1 / GPT-5.4) [[31]] | GLM-5.1 = 51 [[31]] | 6 pts |
| LMArena Elo | GPT-5.5-high 1506 [[32]] | GLM-5.1 / DeepSeek-V4-Pro 1467 [[32]] | 39 Elo |
| SWE-bench Verified | Claude Opus 4.7 87.6% [[29]] | Nemotron 3 Super 120B 60.5% [[29]] | ~27 pts |
| Reasoning benches (avg) | frontier | open | 3–8 pts [[29]] |
| LiveCodeBench (contam-free) | GPT-5.2 / Opus 4.5 >85% [[33]] | GLM-4.7 Thinking [[33]] | small [[33]] |
| Aider polyglot | GPT-5 0.880, Opus 4.6 82.1% [[34]] | trails on Go/Rust/Java [[34]] | language-dependent [[34]] |
Read it this way: on a chat-style aggregate (AA Index, LMArena) the best open model is within ~6 points / ~39 Elo — close enough that procurement noise dominates [31][32]. The reasoning-benchmark gap has collapsed from 30+ points in 2024 to 3-8 points today [29]. But the open-source agentic-coding leaders cited as flagship-grade (DeepSeek V4, Kimi K2.6, GLM-5) are 1T-class datacenter models — the SWE-bench gap to a model you can actually fit on one box is much larger, and persists most on non-Python languages [34]. Artificial Analysis now tracks this explicitly via its Openness Index [30].
Hardware & cost reality
What “local” actually demands, at the Q4_K_M quant (the consensus sweet spot, ~3-5% quality loss vs FP16) [35]:
| Model size | VRAM (Q4_K_M) | Runs on | Throughput (Q4) |
|---|---|---|---|
| 8B | ~4-7 GB [[36]] | any 8GB+ GPU | very fast |
| 32B | ~22-24 GB [[36]] | single RTX 4090 (24GB) | ~650 tok/s (4090) → ~1,100 (5090), batched [[37]] |
| 70B | ~45 GB [[35]] | dual 3090/4090, or Mac | 16-25 tok/s, dual-GPU single-user [[35]] |
| 70B (FP16) | ~143 GB [[35]] | multi-GPU server | — |
Neither the RTX 4090 (24GB) nor the 5090 (32GB) can host a 70B Q4 — that’s the key consumer ceiling, forcing dual-GPU rigs or Apple unified memory [37]. Apple Silicon trades raw speed for capacity: M3 Ultra ~41 tok/s on a 27B Q4, M4 Max ~70 tok/s on 70B-class via MLX [38].
The cost crossover is all about utilization. Naive math says self-hosting beats APIs above ~500K tokens/day sustained — a small team on $850/mo of API recoups $1,500 of hardware in ~1.8 months [41]. One March-2026 batch deployment cut a $6,700/mo cloud bill to $1,280/mo on owned hardware [50]. But: below ~70% GPU utilization the cloud wins, because an idle GPU bills the same as a busy one — at 10% utilization real cost-per-token is 10× the headline rate [42][2]. Add DevOps labor (10-20 hrs/mo) and electricity (~$190/yr per 5090) and the hobbyist break-even moves further out [42]. The mindstudio estimate of a 5-15M-tokens/day breakeven captures the same idea at scale [1].
When to pick which
| Factor | Local wins when… | Frontier wins when… |
|---|---|---|
| Cost | sustained high volume (≥500K tok/day, ≥70% util) [[41]][[42]] | spiky/low volume; idle time is free [[2]] |
| Privacy/compliance | PHI, GDPR, air-gapped — VPC self-host is the clean path [[3]][[5]] | data already permitted to leave; BAA in place |
| Capability ceiling | task is “good enough” tier (most day-to-day) [[29]] | hardest refactors, deep reasoning, long-horizon agents [[44]][[49]] |
| Latency | agentic loops (10-30 calls × round-trips compound) [[6]] | single-shot; you lack local hardware |
| Customization | need full SFT / LoRA / custom RLHF [[4]] | prompt-engineering is enough |
| Lock-in risk | want portability; vendors swing latency 42% mid-quarter [[6]] | you accept managed-vendor dependency [[4]] |
Where local clearly falls short: local 7B models trail GPT-5.5 by 10-20 points on reasoning and hit only 45-55% HumanEval vs ~90% [44]; and long-horizon planning fails for everyone — research shows chain-of-thought is greedy step-scoring, not planning, so “just use a smarter local model” doesn’t fix multi-step agents [49].
Practitioner verdict
The 2026 consensus has settled on tiered hybrid, not either/or. Route the bulk of agent traffic to a self-hosted Qwen3-Coder or Kimi K2.6, escalate the hardest slice to Claude Opus 4.7 / Gemini 3.1 Pro [43]. A Qwen3-Coder 32B on one high-end GPU is “genuinely productive for day-to-day coding” but “will not match Claude Opus on a hard refactor” [43].
- r/LocalLLaMA / HN consensus: open models are “now good enough” for the broad industry, runnable on a $200 used 16GB GPU [46]. The real 2026 bottleneck shifted from generation speed to verification capacity — bounded, orchestrated workflows beat raw autonomy [45].
- Highest-ROI local wins: batch/real-time pipelines — content moderation at sub-500ms, ticket summarization, embeddings — where local beats cloud on cost and control [48].
- Indie hackers are cancelling API subscriptions for local agents; the OllamaClaude pattern (Claude Code as orchestrator, local models doing the coding) claims up to 98.75% API-token reduction [47].
Bottom line for a cage-match: a top open-weight model on rented datacenter GPUs is within striking distance of frontier on most tasks, but the model you can fit on your GPU still trails noticeably on the hardest coding — and frontier wins on convenience-per-dollar until your volume is high and steady.