Atlas survey

Head-to-Head: Agent Harness Framework Comparison Matrix

15-dimension matrix across 12 agent harness frameworks — LangGraph leads on production control, CrewAI on time-to-prototype, Mastra on TypeScript teams; MS Agent Framework (April 2026) replaces Semantic Kernel and AutoGen.

23 sources ~7 min read #183 ai-agents · agent-frameworks · langgraph · crewai · openai-agents · comparison-matrix · 2026
Decision — quick picks by use case
  • Most control, production-grade: LangGraph ⭐ 26.9k — 94% multi-step accuracy at $0.08/task, LangGraph Studio, first-class checkpointing & interrupt(). Cost: 1–2 week learning curve.[1]
  • Fastest prototype: CrewAI ⭐ 52.7k — role-based crews, running in 3–5 days. Hit the branching ceiling → migrate to LangGraph.[3]
  • Lowest cognitive overhead / voice: OpenAI Agents SDK ⭐ 33.7k — working agent in 2–3 days, Realtime API for voice; sequential handoffs only.[6]
  • TypeScript / web stack: Mastra ⭐ 24.7k — only mature TS-native framework, 3,300+ models, time-travel debug, first-class HiL suspend/resume.[23]
  • .NET / Azure / M365 enterprise: MS Agent Framework ⭐ 9.9k — GA April 2, 2026, replaces Semantic Kernel + AutoGen, 7 LLM providers, first-class MCP + A2A.[5]
  • AWS-native / minimal ceremony: Strands ⭐ ~6.7k — 3 primitives, IAM/VPC-native; steering handlers beat prompt-only 100% vs 82.5%.[21]
  • Multimodal (text+audio+video+image) or cross-org A2A: Google ADK ⭐ 20k — only framework with full in-loop multimodal; A2A protocol native.[8]
  • Avoid starting new projects on: AutoGen (archived by Microsoft late 2025) — move to AG2 or MS Agent Framework.[22]

Full Feature Matrix

Framework School License Lang Checkpoint HiL MCP Parallel Code Exec Voice / Multimodal Multi-LLM Learning Curve Studio / Debugger Observability Enterprise
LangGraph ⭐ 26.9k[9] Graph-explicit MIT Py + TS ✓ First-class interrupt() ⚠ Adapter ✓ Reducer-based ⚠ Custom ✓ Any LLM 1–2 weeks ✓ LG Studio LangSmith RBAC + audit
Google ADK ⭐ 20k[13] Graph-explicit Apache 2.0 Py ✓ Pluggable backends ✓ Native (v2.0) ✓ Agent tree ✓ Text+audio+video+image ⚠ Gemini-first Moderate ⚠ OTEL/MLflow OTEL + MLflow A2A native
CrewAI ⭐ 52.7k[10] Role-declarative Apache 2.0 Py ⚠ Callbacks ✓ DSL + adapters ⚠ Limited ✓ Per-agent LLM 3–5 days ✓ AMP editor AMP dashboard ⚠ Enterprise AMP
AutoGen ⭐ 58.7k ⚠ ARCHIVED by Microsoft (late 2025) — maintenance-only. Migrate to AG2 (community fork) or MS Agent Framework.[22]
AG2 ⭐ 4.6k[12] (AutoGen fork) Conv-emergent Apache 2.0 Py ⚠ Event replay ⚠ GroupChat ✓ v0.12 native ⚠ Limited ✓ Docker/local Moderate ⚠ Custom
OpenAI Agents SDK ⭐ 33.7k[11] Handoff-centric MIT Py + TS ⚠ Session (2026) ⚠ Approval callbacks ⚠ v0.7+ ⚠ Sequential only ⚠ Sandbox ✓ Realtime API ⚠ OpenAI-first 2–3 days ⚠ Dashboard Native tracing ⚠ Basic guardrails
Strands Agents ⭐ ~6.7k[20] Model-driven Apache 2.0 Py ⚠ Dev-managed ⚠ Custom tools ✓ First-class ⚠ 4 patterns ✓ AgentCore ✓ Bedrock/Anthropic/OpenAI/Ollama Low (3 primitives) ⚠ OTEL AWS CloudTrail + OTEL AWS IAM/VPC
MS Agent Framework ⭐ 9.9k[14] v1.0 Apr 2026 Platform-integrated MIT Py + .NET ✓ All 5 patterns ✓ pause/resume ✓ First-class ✓ Concurrent fan-out ✓ 7 providers Moderate ⚠ OTEL-based Azure RBAC + M365
Smolagents ⭐ 27.7k[15] Code-as-action Apache 2.0 Py ⚠ Basic ⚠ Limited ✓ Native (code agent) ✓ Local LLMs Very low (40 lines) ⚠ Basic
Pydantic AI ⭐ 17.5k[16] Type-safe pipeline MIT Py ⚠ Basic ⚠ Basic ⚠ Limited ✓ Any provider Low (FastAPI-style) ⚠ Basic
Mastra ⭐ 24.7k[17] TS-native Apache 2.0 TS only ✓ Time-travel ✓ Suspend/resume ✓ First-class ✓ 3,300+ models Moderate ✓ Mastra Studio Built-in evals + tracing ⚠ Growing
LlamaIndex ⭐ 49.9k[18] Data-first MIT Py ✓ Workflows ✓ Multiple Moderate ⚠ LlamaTrace LlamaCloud + LlamaTrace ⚠ LlamaCloud paid

⚠ = partial/limited  ·  HiL = Human-in-Loop  ·  Sources: [2][6]

Performance Benchmarks

Framework Multi-Step Accuracy Cost / Task Notes
LangGraph 94% $0.08 Stateful caching cuts LLM calls 40–50% on repeat workflows[6]
AG2 91% $0.45 Highest accuracy on code write+review+debug; highest token cost[1]
OpenAI Agents SDK 90% $0.11 [2]
Strands Agents 89% (→ 100% with steering handlers) $0.10 Steering handlers vs prompt-only: 100% vs 82.5% on same tasks[21]
CrewAI 87% $0.12 Up to 3× token overhead vs LangGraph on simple tasks[6]
⚠ Harness selection has performance consequences, not just ergonomic ones. The same Claude Opus 4 model scores 64.9% inside one scaffolding and 57.6% inside another on identical benchmark tasks — a 7.3-point spread with no model change, no prompt change.[7] Benchmark figures above: Lushbinary Agent Benchmark Q1 2026.[1] Google ADK, MS Agent Framework, Smolagents, Pydantic AI, Mastra, and LlamaIndex do not yet have published apples-to-apples numbers on this benchmark.

Critical Axes: Where Frameworks Diverge Most

Checkpointing & State Persistence

FrameworkTierDetail
LangGraph Best-in-class Checkpointer API: time-travel, rewind, replay, cross-session resume. Most-cited reason for production adoption.[2]
Mastra Best-in-class (TS) Workflow state persistence with time-travel debugging — unique in the TypeScript ecosystem.[19]
MS Agent Framework Strong Streaming + checkpointing + pause/resume built into all 5 orchestration patterns.[4]
Google ADK Good Session state with pluggable backends; unified graph engine adds deterministic checkpointing.[8]
LlamaIndex Good LlamaIndex Workflows include checkpointing and human-in-loop options for high-stakes processes.[18]
AG2 Partial Event replay preserves conversation history but is not a true workflow checkpoint.
Strands Agents Developer-managed No built-in checkpointing; developer responsibility to persist state. Consistent with minimal-ceremony philosophy.[1]
CrewAI Weak No built-in checkpoint/resume. Most-cited production limitation: typical workaround is rebuilding the flow in LangGraph once checkpointing is needed.[3]

MCP (Model Context Protocol) Support

FrameworkLevelDetail
Strands Agents Architecture-level Built around MCP; semantic search scales tool inventories to thousands of APIs.[21]
MS Agent Framework "Infrastructure, not a checkbox" Dynamic tool discovery from any MCP server without code changes; MCP server and host in one SDK.[4]
Mastra First-class MCP tool sharing built-in; Zod schemas flow end-to-end through MCP tool calls.[19]
Google ADK Native (ADK v2.0) MCP added natively in ADK v2.0, alongside A2A (agent-to-agent) protocol.[8]
AG2 Native (v0.12) Built-in MCP support since v0.12.[1]
CrewAI DSL + adapters First-class in the CrewAI DSL with a growing community adapter ecosystem.[3]
OpenAI Agents SDK Added v0.7 (2025) MCP added mid-2025; not a first-class design primitive.[2]
LangGraph Via LangChain adapter Works, but inherits LangChain adapter layer — not a first-class graph primitive.[2]

2026 Consolidation Events

DateEventImpact
Late 2025 AutoGen placed in maintenance mode by Microsoft Do not start new projects. Community forked as AG2 to preserve v0.2 API.[22]
Feb 2026 Google reframes ADK as an agent execution framework (not a toolkit) Adds unified graph-based engine, GitHub/Jira/MongoDB connectors, OTEL via MLflow, Kotlin for Android.[8]
Apr 2, 2026 MS Agent Framework v1.0 GA — merges Semantic Kernel + AutoGen Single unified SDK for .NET and Python; 5 orchestration patterns, 7 LLM providers, first-class MCP + A2A + AG-UI.[5]
Apr 2026 OpenAI Agents SDK promoted from Swarm experiment to production toolkit TypeScript SDK reaches Python parity; native sandboxing added; voice (Realtime) becomes first-class.[6]
2026 (ongoing) CrewAI v1.14+ adds A2A protocol support Enables cross-framework agent interop; Enterprise AMP adds RBAC and real-time monitoring.[1]

Pick Your Path

If you need…
Maximum control over complex workflows

LangGraph ⭐ 26.9k — directed graph, first-class checkpointing, best observability (LangSmith). Budget 1–2 weeks to climb the learning curve.[3]

If you need…
Fastest path to a working prototype

CrewAI ⭐ 52.7k — role-based crews, stakeholder-legible abstractions, running in 3–5 days. When branching logic becomes complex, migrate to LangGraph (~1–2 weeks rebuild).[3]

If you need…
Simplest API or voice agents

OpenAI Agents SDK ⭐ 33.7k — working agent in 2–3 days, Realtime API for voice, 100+ LLMs. Sequential handoffs only — plan topology before scaling.[6]

If you need…
TypeScript / web-stack agents

Mastra ⭐ 24.7k — only mature TS-native agent framework, 3,300+ models, time-travel debug, full HiL suspend/resume, Mastra Studio.[23]

If you need…
AWS-native deployment

Strands Agents ⭐ ~6.7k — IAM/VPC/Bedrock-native, 3 primitives, minimal boilerplate. Steering handlers outperform prompt-only: 100% vs 82.5%.[21]

If you need…
.NET / Azure / M365 enterprise

MS Agent Framework ⭐ 9.9k — GA April 2026, replaces Semantic Kernel + AutoGen, C#/.NET depth, 7 LLM providers, first-class MCP + A2A.[5]

If you need…
Multimodal or cross-org A2A

Google ADK ⭐ 20k — only framework with text+audio+video+image in-loop; A2A protocol native for cross-organization agent interop.[8]

If you need…
Agents over private data / RAG

LlamaIndex ⭐ 49.9k — built ground-up for retrieval; agent capabilities layer on the strongest indexing foundation in the ecosystem.[18]

If you need…
Code-gen tasks or local LLMs

Smolagents ⭐ 27.7k — 1,000-line core, action primitive is generated Python code, 40-line ReAct agent vs 120 in LangGraph. No production enterprise features.[15]

If you need…
Type-safe structured outputs

Pydantic AI ⭐ 17.5k — FastAPI-style DI, schema-validated outputs with auto self-correction. Not a general-purpose framework — pair with LangGraph or Mastra for orchestration.[16]

If you need…
AutoGen v0.2 API continuity

AG2 ⭐ 4.6k — community fork of AutoGen preserving v0.2 API with event-driven async and MemoryStream pub/sub; 91% accuracy, highest quality for write+review+debug loops. Highest cost at $0.45/task.[1]

Citations · 23 sources

Click the Citations tab to load…