Atlas expedition 4 angles ↓

Agent Harness Toolkit Showdown: 2026

The harness is the performance variable — seven design schools, 12 frameworks, 30-point benchmark spreads, and the evaluation discipline to run a fair comparison.

4 succeeded 84 sources ~27 min read #183

The four children collectively make one argument: the harness is not a neutral scaffold. The same Claude Opus 4 model scores 64.9% vs 57.6% on an identical benchmark task inside two different orchestration scaffolds [1]; on GAIA, the same model spans 74.6% with the right harness to 44.8% bare — a 30-point spread that no model upgrade delivers cleanly [2]. Framework selection is a performance decision masquerading as an infrastructure one.

Philosophy determines ceiling, not just style. The seven schools are not equivalent stylistic choices. Control-explicit frameworks (LangGraph, Google ADK) make you write the graph and reward you with 94% multi-step accuracy at $0.08/task [3]; role-declarative (CrewAI) gives a working prototype in three days but adds up to 3× token overhead on simple tasks [4] and encodes control flow in prompts rather than inspectable state. Model-driven minimalism (Strands) bets the LLM is smart enough to orchestrate; steering hooks — not prompts — close the gap from 82.5% to 100% on the same tasks [5], but the orchestrator agent becomes a single point of failure with no graph-level guard when routing goes wrong.

The convergence layer masks where the real differences live. All six major frameworks now ship MCP, streaming, observability, and ReAct loops as table-stakes [3]. The divergence is above that layer: checkpointing quality (LangGraph and Mastra are strongest, CrewAI has none), MCP depth (Strands and MS Agent Framework treat it as architecture, LangGraph uses adapters), and cost efficiency (AG2 tops quality at $0.45/task; LangGraph leads on cost at $0.08) [3]. Composite benchmark scores hide all of this — and so does any framework whose summary leads with MCP checkmark counts.

The supply-chain risk is structural, not incidental. AutoGen’s October 2025 retirement [6] is the clearest 2026 data point: 58.7k GitHub stars and maintenance-only status in the same quarter. But the deeper lock-in is not code — it is state history. Six months of customer interaction history locked to a platform-native stateful runtime cannot be ported without rebuilding the memory layer from scratch [7]. LLM API calls are 40–60% of total agent operating costs [4]; a framework adding 40% token overhead nearly doubles the largest line item — silently, until the first production invoice arrives.

The evaluation regime must match the claim being made. The CLEAR framework documents a 37% average gap between lab benchmark scores and production deployment performance [4]; 95% of enterprise pilots never reach production regardless of which framework is chosen [4]. A fair comparison session fixes the model, tasks, and environment across all candidates, scores at three levels (end-to-end, trajectory, component) [8], and distinguishes pass@k (can it do this at all?) from pass^k (will it do this every time?) [9]. For the skill-pack layer specifically, SkillTester evidence shows that higher utility consistently correlates with broader permissions and greater security exposure [10] — utility and security must be scored as independent axes, never collapsed into a single number that papers over the tradeoff.

The A2A protocol (native in Google ADK v2.0, CrewAI v1.14+, and MS Agent Framework v1.0) promises cross-framework agent interoperability. Whether it dissolves the winner-takes-all dynamic or simply adds a portability abstraction on top of incompatible state runtimes is the open question this landscape will answer before the end of 2026.

Sub-topics