Agent Harness Showdown: When Each Wins + the Risk

Decision: LangGraph when state durability is non-negotiable; CrewAI to have a working demo by Friday (use Flows, not raw crews in prod); Mastra for TypeScript-first backends; Strands/ADK if already AWS/GCP-native; OpenAI Agents SDK only when GPT-committed. The systemic catch: framework choice alone swings benchmark performance 30 points on identical models,^[4] and 95% of enterprise pilots never reach production regardless of which framework is picked.^[4]

When Each Wins

Framework	✓ Pick when	✗ Hard pass when
LangGraph ⭐ 33.7k [9]	Production durability required — durable state, human-approval checkpoints, audit trail for compliance. Regulated industries (finance, healthcare). Graph-topology workflows with conditional branching and retries. 96% error-recovery rate on multi-step tasks.^[2]	Simple linear single-agent loops with no branching — graph overhead is pure waste. Need a working prototype in hours, not days.^[1]
CrewAI ⭐ 52.7k [10]	Fastest path to working demo (2–4 hours). Workflows that map 1:1 to human specialist roles (researcher, writer, reviewer). First-class MCP integration. 450M monthly workflows in the wild.^[1]	Complex conditional branching and explicit state snapshots. Running without Flows in production — raw crews have determinism problems within months of going live.^[2]
OpenAI Agents SDK ⭐ 26.9k [11]	Linear triage/handoff chains (Triage → Billing → Tech Support). Sandboxed code execution without third-party container wiring. GPT-committed team wanting <100 lines to a working system.^[1]	Need model flexibility (Claude, Gemini, open-source). Long-horizon workflows requiring durable checkpoints. TypeScript teams — support is "planned" with no timeline.^[3]
Mastra ⭐ 24.7k [12]	TypeScript-first backend agents needing memory, evaluation, and HITL in one package. Multi-step workflows where some steps should be deterministic TypeScript and others LLM-driven. Production features (observability, evals) from day one.^[8]	Python-first teams. Thin chat layer or single model call — framework overhead is unwarranted.^[18]
Strands Agents ⭐ 6.0k [15]	AWS-native teams (used in production by Amazon Q Developer, AWS Glue). Dynamic, unpredictable domains where you can't enumerate all workflows upfront. Model-driven orchestration preferred over hardcoded graphs.^[6]	Compliance mandates requiring deterministic execution paths. Systems needing mandatory checkpoints and audit trails.^[6]
Google ADK ⭐ 20.0k [13]	GCP-native enterprise teams. Polyglot requirements (Python, TypeScript, Go, Java in one orchestration layer). A2A protocol needed for agent interoperability. One-command path to Vertex AI Agent Engine, Cloud Run, or GKE.^[17]	Off-GCP deployments where Google Cloud costs are a problem. MCP-native integrations needed — ADK uses adapters, not native MCP.^[3]
MS Agent Framework ⭐ 11.0k [14]	Azure/M365 ecosystem with Microsoft 365, Graph API, Entra, or Fabric integrations. Native Python + .NET interop in one orchestration layer. Multi-agent deliberation where quality beats latency.^[2]	Cost-sensitive applications — GroupChat generates 20+ LLM calls per interaction. Off-Azure footprint.^[3]

Risk Profiles

LangGraph

⭐ 33.7k

complexity Graph state design is irreversible — retrofitting adds costs proportional to the original build.^[4]
latency LangChain dependency pulls in a heavy abstraction stack, increasing bundle size and inference overhead.^[3]
ops Free tier caps at 100K node executions/month; self-hosting the platform requires an enterprise contract.^[5]
maintenance State schema migrations are painful as domain requirements evolve.^[2]

CrewAI

⭐ 52.7k

cost Up to 3× higher token consumption than LangGraph on simple single-tool workflows — a silent cost multiplier at scale.^[4]
debug Without Flows, control flow is encoded into agent prompts — debugging is prompt editing, not state inspection.^[2]
ops Delegation loops consume tokens with no built-in circuit-breaker; standard Python logging doesn't propagate cleanly inside Task callbacks.^[1]
price AMP Suite (enterprise governance) costs $120K/year; open-source lacks SOC 2 and RBAC needed by regulated buyers.^[5]
ops Enterprise deployments report 20-minute delays in "Pending Run" status at peak load.^[4]

OpenAI Agents SDK

⭐ 26.9k

lock-in Hosted tools (Threads, Vector Stores, Code Interpreter) store data on OpenAI's platform — migration is data migration, not just code rewrite.^[7]
durability No native state persistence — a mid-handoff crash loses all conversation context without custom checkpointing infrastructure.^[3]
platform Python-only as of June 2026; TypeScript support has no committed timeline.^[3]
ops GPT model deprecations force system rewrites; pricing changes are outside your control.^[2]

Mastra

⭐ 24.7k

ecosystem TypeScript-only — Python-first or data-science teams need a language boundary in their stack.^[8]
maturity Younger ecosystem — fewer copy-paste answers, more first-principles debugging.^[18]
deps Peer dependency conflicts with AI SDK versions have caused friction in upgrades.^[8]
cost Naive conversation memory grows linearly — context window overflows silently inflate token costs as sessions lengthen.^[18]

Strands Agents

⭐ 6.0k

determinism Non-deterministic by design — traditional unit tests are insufficient; continuous LLM-as-judge evaluation and statistical baselines are required.^[6]
architecture Orchestrator agent is a SPOF — one bad routing decision propagates through all downstream specialists with no graph-level guard.^[16]
maturity Launched May 2025; fewer battle-tested patterns than LangGraph or CrewAI at equivalent scale.^[6]

Google ADK

⭐ 20.0k

lock-in Apache 2.0 license obscures GCP pull — Gemini-optimized internals and experimental MCP adapter (not native) create friction when running off Google Cloud.^[3]
maturity Visual Builder is experimental; smaller community → fewer battle-tested multi-language patterns.^[5]
ops Go/Java agent support adds operational complexity that Python/TS-only shops often underestimate.^[17]

MS Agent Framework

⭐ 11.0k

cost GroupChat pattern generates 20+ LLM calls per interaction through accumulated history — cost balloons in any scenario with extended deliberation.^[3]
lock-in Distributed actor model uses Azure Cosmos DB for durability — cross-cloud portability requires rebuilding the state layer.^[2]
migration Migration from AutoGen/Semantic Kernel creates technical debt; AutoGen entered maintenance mode October 2025.^[5]

Universal Risks (Framework-Agnostic)

The 30-point orchestration gap. Framework choice alone moves benchmark performance by up to 30 points on identical models — Claude Opus 4 scored 64.9% vs 57.6% on GAIA across different orchestration scaffolds (Princeton HAL). A generational model upgrade rarely delivers this spread.^[4]
Lab-to-production collapse. The CLEAR framework documents a 37% average gap between lab benchmark scores and production deployment performance. Public leaderboards hide this failure mode entirely — a framework that tops your benchmark suite may behave very differently once real users drive it.^[4]
The 5% survival rate. MIT analysis of 300+ enterprise implementations found only ~5% move from pilot to production. Root causes: absent observability, missing HITL primitives, no cost discipline from day one — not framework choice itself. Any framework picked without these foundations will fail.^[4]
State management lock-in. The hardest lock-in to escape is not code — it's state history. An enterprise that builds six months of customer interaction history on a platform-native stateful runtime cannot migrate to a different model provider without rebuilding the memory layer from scratch.^[7]
Token cost multiplication. LLM API calls are 40–60% of total agent operating costs. A framework adding 40% token overhead doesn't increase costs by 40% — it nearly doubles the largest line item. CrewAI's role-playing overhead (3× vs LangGraph on simple workflows) is the most common surprise in first production invoices.^[4]

Migration Patterns

Path	Trigger	Effort	Key step
CrewAI → LangGraph	Control-flow ceiling hit in prod; token costs spiking; debugging via prompt editing	1–2 weeks (moderate complexity)	Map each crew to a node, convert process type to explicit graph edges, move shared context to LangGraph's typed state object^[1]
OpenAI SDK → LangGraph	Model flexibility needed; long-horizon state; handoff chain becomes a branching graph	1–3 weeks depending on handoff depth	Replace handoff objects with graph edges; add checkpoint backend (Postgres/Redis); extract OpenAI-specific tool schemas to model-agnostic format^[2]
AutoGen → MS Agent Framework	AutoGen maintenance-mode announcement (Oct 2025); Azure ecosystem investment	Weeks to months (actor model rewrite)	Microsoft provides a migration guide; GroupChat patterns map to distributed actor conversations; existing Semantic Kernel skills plug in directly^[5]
Any → Strands/ADK	AWS or GCP commitment; existing framework too rigid for dynamic requirements	Low if on respective cloud; high if off-cloud	Port tools as typed functions; replace graph edges with model-driven routing; plan for eval infrastructure from day one^[6]