+30
Point Spread
identical model · different harness
12
Frameworks
7 design philosophies
37%
Lab → Prod Gap
CLEAR framework average
95%
Pilot Failure Rate
never reach production
The 30-Point Spread
GAIA Benchmark · Claude Opus 4 · identical model
Orchestration scaffold contribution · Princeton HAL / Q1 2026
Framework selection is a performance decision masquerading as an infrastructure one.
Benchmark Leaderboard
Q1 Lushbinary · multi-step accuracy · cost per task
Framework
Accuracy
Cost/task
Stars
Signal
#1
$0.08
⭐ 26.9k
efficient
#3
$0.11
⭐ 33.7k
balanced
#4
$0.12
⭐ 52.7k
3× tokens
control-graph 96% error recovery
400+ companies · 34.5M PyPI/mo · 40–50% LLM call reduction via stateful caching. Steep 1–2 week learning curve. Production ceiling highest of any framework tested.
[2]
role-declarative 3× token overhead
10M+ agents/month · prototype in 2–4 hrs · AMP Suite $120K/yr. No checkpointing. Control flow encoded in prompts — not inspectable state.
[24]
managed runtime onboard in hours
Evolved from Swarm · MCP added v0.7 · session-based checkpoints (2026) · 100+ LLMs. Lock-in risk: 6-month state history tied to OpenAI Stateful Runtime.
[20]
control-graph A2A v2.0
Reframed as execution framework Feb 2026 · graph-based engine · GitHub/Jira/MongoDB connectors · Vertex AI one-command deploy · Kotlin support. GCP-native teams only.
[1]
type-safe TypeScript-first
From the Gatsby team · 3,300+ models · 94 providers · time-travel debugging · suspend/resume HITL · first-class MCP. TypeScript-only — no Python.
[27]
code-first 40-line agent
HuggingFace · action primitive is generated Python code · 40 lines vs 120 for equivalent LangGraph agent · leads in local LLM support. Minimal orchestration surface.
[1]
type-safe FastAPI-style
Strict types · dependency injection · structured responses. FastAPI ergonomics — immediately familiar to Python web teams. Smaller multi-agent orchestration surface than LangGraph.
[1]
managed GA Apr 2026
Replaces AutoGen + Semantic Kernel · 5 orchestration patterns · 7 LLM providers · MCP first-class · Python + .NET. Inherits 58.7k-star AutoGen community.
[18]
event-async $0.45/task
Community fork of AutoGen v0.2 · MCP v0.12 · event-driven async. GroupChat: 20+ LLM calls per interaction — token costs balloon fast in group deliberation scenarios.
[1]
model-driven AWS-native
From Amazon production systems · steering handlers: 100% vs 82.5% prompt-only accuracy · Bedrock/Anthropic/OpenAI/Ollama. Orchestrator is a single point of failure — bad routing propagates.
[17]
RAG-first checkpoint + HiL
Data framework for LLM apps · Workflows with checkpointing and human-in-the-loop · RAG-optimized agent design. Best when retrieval is the primary agent action.
[14]
superseded migrate → MS AF
C# + Azure-native plugin planner. Superseded by MS Agent Framework GA April 2026. One-year v1.x maintenance commitment. Migrate now.
[19]
7 Design Schools
Philosophy determines ceiling, not just style
Control-explicit graph
LangGraph · Google ADK
Write the graph. Highest accuracy ceiling (94%). Rewarded with 40–50% LLM call savings via state checkpoints. 1–2 week ramp.
Role-declarative
CrewAI
Role-based teams; prototype in 2–4 hrs. 3× token overhead vs LangGraph, no checkpointing, control flow lives in prompts — not inspectable code.
Model-driven minimalism
Strands Agents
LLM decides orchestration. Steering handlers close 82.5% → 100% accuracy gap. Orchestrator is a single point of failure with no graph-level guard on bad routing.
Code-first
Smolagents
Action primitive = generated Python code. 40-line agent vs 120 for LangGraph equivalent. Best for local-LLM research workflows. Limited multi-agent orchestration.
Type-safe composition
PydanticAI · Mastra
Strict typed interfaces and DI throughout. FastAPI ergonomics (Python) or Vercel AI SDK patterns (TypeScript). Predictable — type errors surface at compile time, not runtime.
Managed runtime
OpenAI SDK · MS Agent Framework
Platform-native patterns, fastest onboarding. Lock-in risk is structural: 6-month state history cannot migrate without rebuilding the memory layer from scratch.
Event-driven async
AG2
Community fork of AutoGen v0.2, preserving stable API. Top quality score (91%) at highest cost ($0.45/task). GroupChat = 20+ LLM calls per interaction.
⚠ Reality Check
37%
Average lab-to-production performance gap
CLEAR framework across all tested harnesses ·
[2]
95%
Enterprise pilots that never reach production
Regardless of which framework is chosen ·
[2]
40–60%
Of total agent operating cost is LLM API calls
A 3× token-overhead framework nearly doubles the largest line item ·
[2]
Research Sub-topics
4 surveys · 84 citations total