Head-to-Head: Agent Harness Framework Comparison Matrix

Decision — quick picks by use case

Most control, production-grade: LangGraph ⭐ 26.9k — 94% multi-step accuracy at $0.08/task, LangGraph Studio, first-class checkpointing & interrupt(). Cost: 1–2 week learning curve.^[1]
Fastest prototype: CrewAI ⭐ 52.7k — role-based crews, running in 3–5 days. Hit the branching ceiling → migrate to LangGraph.^[3]
Lowest cognitive overhead / voice: OpenAI Agents SDK ⭐ 33.7k — working agent in 2–3 days, Realtime API for voice; sequential handoffs only.^[6]
TypeScript / web stack: Mastra ⭐ 24.7k — only mature TS-native framework, 3,300+ models, time-travel debug, first-class HiL suspend/resume.^[23]
.NET / Azure / M365 enterprise: MS Agent Framework ⭐ 9.9k — GA April 2, 2026, replaces Semantic Kernel + AutoGen, 7 LLM providers, first-class MCP + A2A.^[5]
AWS-native / minimal ceremony: Strands ⭐ ~6.7k — 3 primitives, IAM/VPC-native; steering handlers beat prompt-only 100% vs 82.5%.^[21]
Multimodal (text+audio+video+image) or cross-org A2A: Google ADK ⭐ 20k — only framework with full in-loop multimodal; A2A protocol native.^[8]
Avoid starting new projects on: AutoGen (archived by Microsoft late 2025) — move to AG2 or MS Agent Framework.^[22]

Full Feature Matrix

Framework	School	License	Lang	Checkpoint	HiL	MCP	Parallel	Code Exec	Voice / Multimodal	Multi-LLM	Learning Curve	Studio / Debugger	Observability	Enterprise
LangGraph ⭐ 26.9k^[9]	Graph-explicit	MIT	Py + TS	✓ First-class	✓ `interrupt()`	⚠ Adapter	✓ Reducer-based	⚠ Custom	✗	✓ Any LLM	1–2 weeks	✓ LG Studio	LangSmith	RBAC + audit
Google ADK ⭐ 20k^[13]	Graph-explicit	Apache 2.0	Py	✓ Pluggable backends	✓	✓ Native (v2.0)	✓ Agent tree	⚠	✓ Text+audio+video+image	⚠ Gemini-first	Moderate	⚠ OTEL/MLflow	OTEL + MLflow	A2A native
CrewAI ⭐ 52.7k^[10]	Role-declarative	Apache 2.0	Py	✗	⚠ Callbacks	✓ DSL + adapters	⚠ Limited	✗	✗	✓ Per-agent LLM	3–5 days	✓ AMP editor	AMP dashboard	⚠ Enterprise AMP
AutoGen ⭐ 58.7k	⚠ ARCHIVED by Microsoft (late 2025) — maintenance-only. Migrate to AG2 (community fork) or MS Agent Framework.^[22]
AG2 ⭐ 4.6k^[12] (AutoGen fork)	Conv-emergent	Apache 2.0	Py	⚠ Event replay	⚠ GroupChat	✓ v0.12 native	⚠ Limited	✓ Docker/local	✗	✓	Moderate	✗	⚠ Custom	✗
OpenAI Agents SDK ⭐ 33.7k^[11]	Handoff-centric	MIT	Py + TS	⚠ Session (2026)	⚠ Approval callbacks	⚠ v0.7+	⚠ Sequential only	⚠ Sandbox	✓ Realtime API	⚠ OpenAI-first	2–3 days	⚠ Dashboard	Native tracing	⚠ Basic guardrails
Strands Agents ⭐ ~6.7k^[20]	Model-driven	Apache 2.0	Py	⚠ Dev-managed	⚠ Custom tools	✓ First-class	⚠ 4 patterns	✓ AgentCore	✗	✓ Bedrock/Anthropic/OpenAI/Ollama	Low (3 primitives)	⚠ OTEL	AWS CloudTrail + OTEL	AWS IAM/VPC
MS Agent Framework ⭐ 9.9k^[14] v1.0 Apr 2026	Platform-integrated	MIT	Py + .NET	✓ All 5 patterns	✓ pause/resume	✓ First-class	✓ Concurrent fan-out	⚠	✗	✓ 7 providers	Moderate	⚠	⚠ OTEL-based	Azure RBAC + M365
Smolagents ⭐ 27.7k^[15]	Code-as-action	Apache 2.0	Py	⚠ Basic	✗	✓	⚠ Limited	✓ Native (code agent)	✗	✓ Local LLMs	Very low (40 lines)	✗	⚠ Basic	✗
Pydantic AI ⭐ 17.5k^[16]	Type-safe pipeline	MIT	Py	⚠ Basic	⚠ Basic	✓	⚠ Limited	✗	✗	✓ Any provider	Low (FastAPI-style)	✗	⚠ Basic	✗
Mastra ⭐ 24.7k^[17]	TS-native	Apache 2.0	TS only	✓ Time-travel	✓ Suspend/resume	✓ First-class	✓	✗	✗	✓ 3,300+ models	Moderate	✓ Mastra Studio	Built-in evals + tracing	⚠ Growing
LlamaIndex ⭐ 49.9k^[18]	Data-first	MIT	Py	✓ Workflows	✓	⚠	✓	✗	✗	✓ Multiple	Moderate	⚠ LlamaTrace	LlamaCloud + LlamaTrace	⚠ LlamaCloud paid

⚠ = partial/limited · HiL = Human-in-Loop · Sources: ^[2]^[6]

Performance Benchmarks

Framework	Multi-Step Accuracy	Cost / Task	Notes
LangGraph	94%	$0.08	Stateful caching cuts LLM calls 40–50% on repeat workflows^[6]
AG2	91%	$0.45	Highest accuracy on code write+review+debug; highest token cost^[1]
OpenAI Agents SDK	90%	$0.11	^[2]
Strands Agents	89% (→ 100% with steering handlers)	$0.10	Steering handlers vs prompt-only: 100% vs 82.5% on same tasks^[21]
CrewAI	87%	$0.12	Up to 3× token overhead vs LangGraph on simple tasks^[6]

⚠ Harness selection has performance consequences, not just ergonomic ones. The same Claude Opus 4 model scores 64.9% inside one scaffolding and 57.6% inside another on identical benchmark tasks — a 7.3-point spread with no model change, no prompt change.^[7] Benchmark figures above: Lushbinary Agent Benchmark Q1 2026.^[1] Google ADK, MS Agent Framework, Smolagents, Pydantic AI, Mastra, and LlamaIndex do not yet have published apples-to-apples numbers on this benchmark.

Critical Axes: Where Frameworks Diverge Most

Checkpointing & State Persistence

Framework	Tier	Detail
LangGraph	Best-in-class	Checkpointer API: time-travel, rewind, replay, cross-session resume. Most-cited reason for production adoption.^[2]
Mastra	Best-in-class (TS)	Workflow state persistence with time-travel debugging — unique in the TypeScript ecosystem.^[19]
MS Agent Framework	Strong	Streaming + checkpointing + pause/resume built into all 5 orchestration patterns.^[4]
Google ADK	Good	Session state with pluggable backends; unified graph engine adds deterministic checkpointing.^[8]
LlamaIndex	Good	LlamaIndex Workflows include checkpointing and human-in-loop options for high-stakes processes.^[18]
AG2	Partial	Event replay preserves conversation history but is not a true workflow checkpoint.
Strands Agents	Developer-managed	No built-in checkpointing; developer responsibility to persist state. Consistent with minimal-ceremony philosophy.^[1]
CrewAI	Weak	No built-in checkpoint/resume. Most-cited production limitation: typical workaround is rebuilding the flow in LangGraph once checkpointing is needed.^[3]

MCP (Model Context Protocol) Support

Framework	Level	Detail
Strands Agents	Architecture-level	Built around MCP; semantic search scales tool inventories to thousands of APIs.^[21]
MS Agent Framework	"Infrastructure, not a checkbox"	Dynamic tool discovery from any MCP server without code changes; MCP server and host in one SDK.^[4]
Mastra	First-class	MCP tool sharing built-in; Zod schemas flow end-to-end through MCP tool calls.^[19]
Google ADK	Native (ADK v2.0)	MCP added natively in ADK v2.0, alongside A2A (agent-to-agent) protocol.^[8]
AG2	Native (v0.12)	Built-in MCP support since v0.12.^[1]
CrewAI	DSL + adapters	First-class in the CrewAI DSL with a growing community adapter ecosystem.^[3]
OpenAI Agents SDK	Added v0.7 (2025)	MCP added mid-2025; not a first-class design primitive.^[2]
LangGraph	Via LangChain adapter	Works, but inherits LangChain adapter layer — not a first-class graph primitive.^[2]

2026 Consolidation Events

Date	Event	Impact
Late 2025	AutoGen placed in maintenance mode by Microsoft	Do not start new projects. Community forked as AG2 to preserve v0.2 API.^[22]
Feb 2026	Google reframes ADK as an agent execution framework (not a toolkit)	Adds unified graph-based engine, GitHub/Jira/MongoDB connectors, OTEL via MLflow, Kotlin for Android.^[8]
Apr 2, 2026	MS Agent Framework v1.0 GA — merges Semantic Kernel + AutoGen	Single unified SDK for .NET and Python; 5 orchestration patterns, 7 LLM providers, first-class MCP + A2A + AG-UI.^[5]
Apr 2026	OpenAI Agents SDK promoted from Swarm experiment to production toolkit	TypeScript SDK reaches Python parity; native sandboxing added; voice (Realtime) becomes first-class.^[6]
2026 (ongoing)	CrewAI v1.14+ adds A2A protocol support	Enables cross-framework agent interop; Enterprise AMP adds RBAC and real-time monitoring.^[1]

Pick Your Path

If you need…

Maximum control over complex workflows

LangGraph ⭐ 26.9k — directed graph, first-class checkpointing, best observability (LangSmith). Budget 1–2 weeks to climb the learning curve.^[3]

If you need…

Fastest path to a working prototype

CrewAI ⭐ 52.7k — role-based crews, stakeholder-legible abstractions, running in 3–5 days. When branching logic becomes complex, migrate to LangGraph (~1–2 weeks rebuild).^[3]

If you need…

Simplest API or voice agents

OpenAI Agents SDK ⭐ 33.7k — working agent in 2–3 days, Realtime API for voice, 100+ LLMs. Sequential handoffs only — plan topology before scaling.^[6]

If you need…

TypeScript / web-stack agents

Mastra ⭐ 24.7k — only mature TS-native agent framework, 3,300+ models, time-travel debug, full HiL suspend/resume, Mastra Studio.^[23]

If you need…

AWS-native deployment

Strands Agents ⭐ ~6.7k — IAM/VPC/Bedrock-native, 3 primitives, minimal boilerplate. Steering handlers outperform prompt-only: 100% vs 82.5%.^[21]

If you need…

.NET / Azure / M365 enterprise

MS Agent Framework ⭐ 9.9k — GA April 2026, replaces Semantic Kernel + AutoGen, C#/.NET depth, 7 LLM providers, first-class MCP + A2A.^[5]

If you need…

Multimodal or cross-org A2A

Google ADK ⭐ 20k — only framework with text+audio+video+image in-loop; A2A protocol native for cross-organization agent interop.^[8]

If you need…

Agents over private data / RAG

LlamaIndex ⭐ 49.9k — built ground-up for retrieval; agent capabilities layer on the strongest indexing foundation in the ecosystem.^[18]

If you need…

Code-gen tasks or local LLMs

Smolagents ⭐ 27.7k — 1,000-line core, action primitive is generated Python code, 40-line ReAct agent vs 120 in LangGraph. No production enterprise features.^[15]

If you need…

Type-safe structured outputs

Pydantic AI ⭐ 17.5k — FastAPI-style DI, schema-validated outputs with auto self-correction. Not a general-purpose framework — pair with LangGraph or Mastra for orchestration.^[16]

If you need…

AutoGen v0.2 API continuity

AG2 ⭐ 4.6k — community fork of AutoGen preserving v0.2 API with event-driven async and MemoryStream pub/sub; 91% accuracy, highest quality for write+review+debug loops. Highest cost at $0.45/task.^[1]

Head-to-Head: Agent Harness Framework Comparison Matrix

Full Feature Matrix

Performance Benchmarks

Critical Axes: Where Frameworks Diverge Most

Checkpointing & State Persistence

MCP (Model Context Protocol) Support

2026 Consolidation Events

Pick Your Path

Citations · 23 sources