Atlas survey

Agent Harness Showdown: When Each Wins + the Risk

Framework choice alone swings benchmark scores 30 points on identical models — here is when each of the seven major agent harnesses wins and what breaks in production.

18 sources ~9 min read #183 ai-agents · agent-frameworks · langgraph · crewai · openai-agents · mastra · strands-agents · google-adk · microsoft-agent-framework · risk-assessment · 2026
Decision: LangGraph when state durability is non-negotiable; CrewAI to have a working demo by Friday (use Flows, not raw crews in prod); Mastra for TypeScript-first backends; Strands/ADK if already AWS/GCP-native; OpenAI Agents SDK only when GPT-committed. The systemic catch: framework choice alone swings benchmark performance 30 points on identical models,[4] and 95% of enterprise pilots never reach production regardless of which framework is picked.[4]

When Each Wins

Framework ✓ Pick when ✗ Hard pass when
LangGraph ⭐ 33.7k [9] Production durability required — durable state, human-approval checkpoints, audit trail for compliance. Regulated industries (finance, healthcare). Graph-topology workflows with conditional branching and retries. 96% error-recovery rate on multi-step tasks.[2] Simple linear single-agent loops with no branching — graph overhead is pure waste. Need a working prototype in hours, not days.[1]
CrewAI ⭐ 52.7k [10] Fastest path to working demo (2–4 hours). Workflows that map 1:1 to human specialist roles (researcher, writer, reviewer). First-class MCP integration. 450M monthly workflows in the wild.[1] Complex conditional branching and explicit state snapshots. Running without Flows in production — raw crews have determinism problems within months of going live.[2]
OpenAI Agents SDK ⭐ 26.9k [11] Linear triage/handoff chains (Triage → Billing → Tech Support). Sandboxed code execution without third-party container wiring. GPT-committed team wanting <100 lines to a working system.[1] Need model flexibility (Claude, Gemini, open-source). Long-horizon workflows requiring durable checkpoints. TypeScript teams — support is "planned" with no timeline.[3]
Mastra ⭐ 24.7k [12] TypeScript-first backend agents needing memory, evaluation, and HITL in one package. Multi-step workflows where some steps should be deterministic TypeScript and others LLM-driven. Production features (observability, evals) from day one.[8] Python-first teams. Thin chat layer or single model call — framework overhead is unwarranted.[18]
Strands Agents ⭐ 6.0k [15] AWS-native teams (used in production by Amazon Q Developer, AWS Glue). Dynamic, unpredictable domains where you can't enumerate all workflows upfront. Model-driven orchestration preferred over hardcoded graphs.[6] Compliance mandates requiring deterministic execution paths. Systems needing mandatory checkpoints and audit trails.[6]
Google ADK ⭐ 20.0k [13] GCP-native enterprise teams. Polyglot requirements (Python, TypeScript, Go, Java in one orchestration layer). A2A protocol needed for agent interoperability. One-command path to Vertex AI Agent Engine, Cloud Run, or GKE.[17] Off-GCP deployments where Google Cloud costs are a problem. MCP-native integrations needed — ADK uses adapters, not native MCP.[3]
MS Agent Framework ⭐ 11.0k [14] Azure/M365 ecosystem with Microsoft 365, Graph API, Entra, or Fabric integrations. Native Python + .NET interop in one orchestration layer. Multi-agent deliberation where quality beats latency.[2] Cost-sensitive applications — GroupChat generates 20+ LLM calls per interaction. Off-Azure footprint.[3]

Risk Profiles

LangGraph

⭐ 33.7k
  • complexity Graph state design is irreversible — retrofitting adds costs proportional to the original build.[4]
  • latency LangChain dependency pulls in a heavy abstraction stack, increasing bundle size and inference overhead.[3]
  • ops Free tier caps at 100K node executions/month; self-hosting the platform requires an enterprise contract.[5]
  • maintenance State schema migrations are painful as domain requirements evolve.[2]

CrewAI

⭐ 52.7k
  • cost Up to 3× higher token consumption than LangGraph on simple single-tool workflows — a silent cost multiplier at scale.[4]
  • debug Without Flows, control flow is encoded into agent prompts — debugging is prompt editing, not state inspection.[2]
  • ops Delegation loops consume tokens with no built-in circuit-breaker; standard Python logging doesn't propagate cleanly inside Task callbacks.[1]
  • price AMP Suite (enterprise governance) costs $120K/year; open-source lacks SOC 2 and RBAC needed by regulated buyers.[5]
  • ops Enterprise deployments report 20-minute delays in "Pending Run" status at peak load.[4]

OpenAI Agents SDK

⭐ 26.9k
  • lock-in Hosted tools (Threads, Vector Stores, Code Interpreter) store data on OpenAI's platform — migration is data migration, not just code rewrite.[7]
  • durability No native state persistence — a mid-handoff crash loses all conversation context without custom checkpointing infrastructure.[3]
  • platform Python-only as of June 2026; TypeScript support has no committed timeline.[3]
  • ops GPT model deprecations force system rewrites; pricing changes are outside your control.[2]

Mastra

⭐ 24.7k
  • ecosystem TypeScript-only — Python-first or data-science teams need a language boundary in their stack.[8]
  • maturity Younger ecosystem — fewer copy-paste answers, more first-principles debugging.[18]
  • deps Peer dependency conflicts with AI SDK versions have caused friction in upgrades.[8]
  • cost Naive conversation memory grows linearly — context window overflows silently inflate token costs as sessions lengthen.[18]

Strands Agents

⭐ 6.0k
  • determinism Non-deterministic by design — traditional unit tests are insufficient; continuous LLM-as-judge evaluation and statistical baselines are required.[6]
  • architecture Orchestrator agent is a SPOF — one bad routing decision propagates through all downstream specialists with no graph-level guard.[16]
  • maturity Launched May 2025; fewer battle-tested patterns than LangGraph or CrewAI at equivalent scale.[6]

Google ADK

⭐ 20.0k
  • lock-in Apache 2.0 license obscures GCP pull — Gemini-optimized internals and experimental MCP adapter (not native) create friction when running off Google Cloud.[3]
  • maturity Visual Builder is experimental; smaller community → fewer battle-tested multi-language patterns.[5]
  • ops Go/Java agent support adds operational complexity that Python/TS-only shops often underestimate.[17]

MS Agent Framework

⭐ 11.0k
  • cost GroupChat pattern generates 20+ LLM calls per interaction through accumulated history — cost balloons in any scenario with extended deliberation.[3]
  • lock-in Distributed actor model uses Azure Cosmos DB for durability — cross-cloud portability requires rebuilding the state layer.[2]
  • migration Migration from AutoGen/Semantic Kernel creates technical debt; AutoGen entered maintenance mode October 2025.[5]

Universal Risks (Framework-Agnostic)

  • The 30-point orchestration gap. Framework choice alone moves benchmark performance by up to 30 points on identical models — Claude Opus 4 scored 64.9% vs 57.6% on GAIA across different orchestration scaffolds (Princeton HAL). A generational model upgrade rarely delivers this spread.[4]
  • Lab-to-production collapse. The CLEAR framework documents a 37% average gap between lab benchmark scores and production deployment performance. Public leaderboards hide this failure mode entirely — a framework that tops your benchmark suite may behave very differently once real users drive it.[4]
  • The 5% survival rate. MIT analysis of 300+ enterprise implementations found only ~5% move from pilot to production. Root causes: absent observability, missing HITL primitives, no cost discipline from day one — not framework choice itself. Any framework picked without these foundations will fail.[4]
  • State management lock-in. The hardest lock-in to escape is not code — it's state history. An enterprise that builds six months of customer interaction history on a platform-native stateful runtime cannot migrate to a different model provider without rebuilding the memory layer from scratch.[7]
  • Token cost multiplication. LLM API calls are 40–60% of total agent operating costs. A framework adding 40% token overhead doesn't increase costs by 40% — it nearly doubles the largest line item. CrewAI's role-playing overhead (3× vs LangGraph on simple workflows) is the most common surprise in first production invoices.[4]

Migration Patterns

Path Trigger Effort Key step
CrewAI → LangGraph Control-flow ceiling hit in prod; token costs spiking; debugging via prompt editing 1–2 weeks (moderate complexity) Map each crew to a node, convert process type to explicit graph edges, move shared context to LangGraph's typed state object[1]
OpenAI SDK → LangGraph Model flexibility needed; long-horizon state; handoff chain becomes a branching graph 1–3 weeks depending on handoff depth Replace handoff objects with graph edges; add checkpoint backend (Postgres/Redis); extract OpenAI-specific tool schemas to model-agnostic format[2]
AutoGen → MS Agent Framework AutoGen maintenance-mode announcement (Oct 2025); Azure ecosystem investment Weeks to months (actor model rewrite) Microsoft provides a migration guide; GroupChat patterns map to distributed actor conversations; existing Semantic Kernel skills plug in directly[5]
Any → Strands/ADK AWS or GCP commitment; existing framework too rigid for dynamic requirements Low if on respective cloud; high if off-cloud Port tools as typed functions; replace graph edges with model-driven routing; plan for eval infrastructure from day one[6]

Citations · 18 sources

Click the Citations tab to load…