Atlas expedition

Frontier vs Local Model Shootout (June 2026)

Open weights now trail the closed frontier by ~4 months on Epoch's index and ~6 points on Artificial Analysis; here's where local wins, where frontier is still mandatory, and what the hardware costs.

50 sources ~8 min read #208 llm · local-models · frontier-models · benchmarks · self-hosting · cost

Decision. Default to a frontier API (Claude Opus 4.8, GPT-5.5, Gemini 3.1 Pro) for the hardest coding, long-horizon agents, and low-volume exploratory work — the gap is small but real and you pay nothing when idle [7][29]. Run a local open-weight model (Qwen3-Coder, DeepSeek V4, Kimi K2.6, GLM-5) when privacy/compliance forces it, when you sustain high volume (~500K+ tokens/day), or when you need full fine-tuning control [41][3][4]. The mature answer is hybrid: route 60-80% of traffic to local, escalate the hardest 20-40% to a frontier API [43].

The headline of mid-2026 is convergence. Epoch AI’s Capabilities Index puts the best open-weight models about four months and ~8 ECI points behind the closed frontier — and notably that lag widened slightly from ~3 months in late 2025, a caution against assuming the gap closes on its own [27][28]. On aggregate intelligence the gap is now noise-adjacent; on the hardest coding and long-horizon agentic work, frontier still wins clearly.

The frontier scoreboard

A four-way race. Scores from the Artificial Analysis Intelligence Index (June 2026) and SWE-bench Verified; pricing per million tokens (input/output).

Model AA Intelligence SWE-bench Verified Price in/out (std) Notes
Claude Opus 4.8 61.4 [[7]] 88.6% [[8]] $5 / $25 [[10]] Coding co-leader; Fast mode $10/$50 [[11]]
GPT-5.5 60.2 [[7]] $5 / $30 [[12]] 2× price jump over 5.4, but less verbose [[13]]
Gemini 3.1 Pro 57 [[7]] $2 / $12 [[16]] Cheapest frontier-Pro tier [[16]]
Grok 4.3 53 [[14]] ~74% [[15]] $1.25 / $2.50 [[14]] Value + agentic; 1M ctx; trails Opus by ~14pts [[15]]

⚠ Treat SWE-bench Verified cautiously: contamination inflates it. Claude Opus 4.5 scores 80.9% on Verified but only 45.9% on the cleaner SWE-bench Pro — a 35-point gap that says models are partly remembering solutions [9]. Cross-vendor flagship pricing for reference: GPT-5.2 $1.75/$14 [40], Claude Sonnet 4.6 $3/$15, Haiku 4.5 $1/$5 [39], GPT-5.5 Pro $30/$180 [12].

The open-weight contenders

Chinese labs lead the open frontier; sizes are MoE (total / active params).

Model Params (total/active) License SWE-bench Verified Local fit
DeepSeek V4 Pro 1.6T / 49B MIT [[18]] 80.6% [[18]] Datacenter only; V4-Flash 284B lighter
Kimi K2.6 1T / 32B mod. MIT [[19]] 80.2% [[19]] Datacenter; agentic-coding tuned
GLM-5 / 5.1 744B / 40-44B MIT [[20]] ~94.6% of Opus 4.6 coding (vendor) [[20]] Multi-GPU; trained on Ascend silicon
Qwen3-Coder ⭐ 16.6k 480B/35B & 30B/3B Apache-2.0 [[22]] ≈ Claude Sonnet [[22]] 30B-A3B runs on a single 24GB GPU
Qwen3.6-27B 27B dense Apache-2.0 [[23]] beats 397B on coding [[23]] Flagship coding on a consumer GPU
gpt-oss-120b 120B / 5.1B Apache-2.0 [[21]] reasoning-focused [[21]] Fits 80GB VRAM (MXFP4) [[21]]
Gemma 4 26B 26B / 3.8B Apache-2.0 [[24]] Top 24GB-VRAM pick [[24]]
Mistral Small 4 119B / ~6B Apache-2.0 [[26]] Single consumer GPU w/ quant [[26]]
Llama 4 Scout 109B / 17B Llama (restrictive) [[25]] 32GB+ unified mem; EU-limited [[25]]

Overall open-weight ranking (benchlm.ai, mid-2026): DeepSeek V4 Pro 87, Kimi K2.6 84, GLM-5.1 83, Qwen3.5 79 [17]. Phi-4 14B (MIT, 16GB) and DeepSeek-R1 distills (1.5B–70B) fill the small-model tiers [24].

How wide is the gap?

It depends entirely on which benchmark you look at — narrow on aggregate, wider on hard coding.

Benchmark Frontier leader Best open weight Gap
Epoch Capabilities Index (time) closed frontier ~4 months behind [[28]] widening [[28]]
AA Intelligence Index (Apr) 57 (Opus 4.7 / Gem 3.1 / GPT-5.4) [[31]] GLM-5.1 = 51 [[31]] 6 pts
LMArena Elo GPT-5.5-high 1506 [[32]] GLM-5.1 / DeepSeek-V4-Pro 1467 [[32]] 39 Elo
SWE-bench Verified Claude Opus 4.7 87.6% [[29]] Nemotron 3 Super 120B 60.5% [[29]] ~27 pts
Reasoning benches (avg) frontier open 3–8 pts [[29]]
LiveCodeBench (contam-free) GPT-5.2 / Opus 4.5 >85% [[33]] GLM-4.7 Thinking [[33]] small [[33]]
Aider polyglot GPT-5 0.880, Opus 4.6 82.1% [[34]] trails on Go/Rust/Java [[34]] language-dependent [[34]]

Read it this way: on a chat-style aggregate (AA Index, LMArena) the best open model is within ~6 points / ~39 Elo — close enough that procurement noise dominates [31][32]. The reasoning-benchmark gap has collapsed from 30+ points in 2024 to 3-8 points today [29]. But the open-source agentic-coding leaders cited as flagship-grade (DeepSeek V4, Kimi K2.6, GLM-5) are 1T-class datacenter models — the SWE-bench gap to a model you can actually fit on one box is much larger, and persists most on non-Python languages [34]. Artificial Analysis now tracks this explicitly via its Openness Index [30].

Hardware & cost reality

What “local” actually demands, at the Q4_K_M quant (the consensus sweet spot, ~3-5% quality loss vs FP16) [35]:

Model size VRAM (Q4_K_M) Runs on Throughput (Q4)
8B ~4-7 GB [[36]] any 8GB+ GPU very fast
32B ~22-24 GB [[36]] single RTX 4090 (24GB) ~650 tok/s (4090) → ~1,100 (5090), batched [[37]]
70B ~45 GB [[35]] dual 3090/4090, or Mac 16-25 tok/s, dual-GPU single-user [[35]]
70B (FP16) ~143 GB [[35]] multi-GPU server

Neither the RTX 4090 (24GB) nor the 5090 (32GB) can host a 70B Q4 — that’s the key consumer ceiling, forcing dual-GPU rigs or Apple unified memory [37]. Apple Silicon trades raw speed for capacity: M3 Ultra ~41 tok/s on a 27B Q4, M4 Max ~70 tok/s on 70B-class via MLX [38].

The cost crossover is all about utilization. Naive math says self-hosting beats APIs above ~500K tokens/day sustained — a small team on $850/mo of API recoups $1,500 of hardware in ~1.8 months [41]. One March-2026 batch deployment cut a $6,700/mo cloud bill to $1,280/mo on owned hardware [50]. But: below ~70% GPU utilization the cloud wins, because an idle GPU bills the same as a busy one — at 10% utilization real cost-per-token is 10× the headline rate [42][2]. Add DevOps labor (10-20 hrs/mo) and electricity (~$190/yr per 5090) and the hobbyist break-even moves further out [42]. The mindstudio estimate of a 5-15M-tokens/day breakeven captures the same idea at scale [1].

When to pick which

Factor Local wins when… Frontier wins when…
Cost sustained high volume (≥500K tok/day, ≥70% util) [[41]][[42]] spiky/low volume; idle time is free [[2]]
Privacy/compliance PHI, GDPR, air-gapped — VPC self-host is the clean path [[3]][[5]] data already permitted to leave; BAA in place
Capability ceiling task is “good enough” tier (most day-to-day) [[29]] hardest refactors, deep reasoning, long-horizon agents [[44]][[49]]
Latency agentic loops (10-30 calls × round-trips compound) [[6]] single-shot; you lack local hardware
Customization need full SFT / LoRA / custom RLHF [[4]] prompt-engineering is enough
Lock-in risk want portability; vendors swing latency 42% mid-quarter [[6]] you accept managed-vendor dependency [[4]]

Where local clearly falls short: local 7B models trail GPT-5.5 by 10-20 points on reasoning and hit only 45-55% HumanEval vs ~90% [44]; and long-horizon planning fails for everyone — research shows chain-of-thought is greedy step-scoring, not planning, so “just use a smarter local model” doesn’t fix multi-step agents [49].

Practitioner verdict

The 2026 consensus has settled on tiered hybrid, not either/or. Route the bulk of agent traffic to a self-hosted Qwen3-Coder or Kimi K2.6, escalate the hardest slice to Claude Opus 4.7 / Gemini 3.1 Pro [43]. A Qwen3-Coder 32B on one high-end GPU is “genuinely productive for day-to-day coding” but “will not match Claude Opus on a hard refactor” [43].

  • r/LocalLLaMA / HN consensus: open models are “now good enough” for the broad industry, runnable on a $200 used 16GB GPU [46]. The real 2026 bottleneck shifted from generation speed to verification capacity — bounded, orchestrated workflows beat raw autonomy [45].
  • Highest-ROI local wins: batch/real-time pipelines — content moderation at sub-500ms, ticket summarization, embeddings — where local beats cloud on cost and control [48].
  • Indie hackers are cancelling API subscriptions for local agents; the OllamaClaude pattern (Claude Code as orchestrator, local models doing the coding) claims up to 98.75% API-token reduction [47].

Bottom line for a cage-match: a top open-weight model on rented datacenter GPUs is within striking distance of frontier on most tasks, but the model you can fit on your GPU still trails noticeably on the hardest coding — and frontier wins on convenience-per-dollar until your volume is high and steady.

Citations · 50 sources

Click the Citations tab to load…