Decision: LangGraph when state durability is non-negotiable; CrewAI to have a working demo by Friday (use Flows, not raw crews in prod); Mastra for TypeScript-first backends; Strands/ADK if already AWS/GCP-native; OpenAI Agents SDK only when GPT-committed. The systemic catch: framework choice alone swings benchmark performance 30 points on identical models,[4] and 95% of enterprise pilots never reach production regardless of which framework is picked.[4]
When Each Wins
| Framework | ✓ Pick when | ✗ Hard pass when |
|---|---|---|
| LangGraph ⭐ 33.7k [9] | Production durability required — durable state, human-approval checkpoints, audit trail for compliance. Regulated industries (finance, healthcare). Graph-topology workflows with conditional branching and retries. 96% error-recovery rate on multi-step tasks.[2] | Simple linear single-agent loops with no branching — graph overhead is pure waste. Need a working prototype in hours, not days.[1] |
| CrewAI ⭐ 52.7k [10] | Fastest path to working demo (2–4 hours). Workflows that map 1:1 to human specialist roles (researcher, writer, reviewer). First-class MCP integration. 450M monthly workflows in the wild.[1] | Complex conditional branching and explicit state snapshots. Running without Flows in production — raw crews have determinism problems within months of going live.[2] |
| OpenAI Agents SDK ⭐ 26.9k [11] | Linear triage/handoff chains (Triage → Billing → Tech Support). Sandboxed code execution without third-party container wiring. GPT-committed team wanting <100 lines to a working system.[1] | Need model flexibility (Claude, Gemini, open-source). Long-horizon workflows requiring durable checkpoints. TypeScript teams — support is "planned" with no timeline.[3] |
| Mastra ⭐ 24.7k [12] | TypeScript-first backend agents needing memory, evaluation, and HITL in one package. Multi-step workflows where some steps should be deterministic TypeScript and others LLM-driven. Production features (observability, evals) from day one.[8] | Python-first teams. Thin chat layer or single model call — framework overhead is unwarranted.[18] |
| Strands Agents ⭐ 6.0k [15] | AWS-native teams (used in production by Amazon Q Developer, AWS Glue). Dynamic, unpredictable domains where you can't enumerate all workflows upfront. Model-driven orchestration preferred over hardcoded graphs.[6] | Compliance mandates requiring deterministic execution paths. Systems needing mandatory checkpoints and audit trails.[6] |
| Google ADK ⭐ 20.0k [13] | GCP-native enterprise teams. Polyglot requirements (Python, TypeScript, Go, Java in one orchestration layer). A2A protocol needed for agent interoperability. One-command path to Vertex AI Agent Engine, Cloud Run, or GKE.[17] | Off-GCP deployments where Google Cloud costs are a problem. MCP-native integrations needed — ADK uses adapters, not native MCP.[3] |
| MS Agent Framework ⭐ 11.0k [14] | Azure/M365 ecosystem with Microsoft 365, Graph API, Entra, or Fabric integrations. Native Python + .NET interop in one orchestration layer. Multi-agent deliberation where quality beats latency.[2] | Cost-sensitive applications — GroupChat generates 20+ LLM calls per interaction. Off-Azure footprint.[3] |
Risk Profiles
LangGraph
⭐ 33.7k- complexity Graph state design is irreversible — retrofitting adds costs proportional to the original build.[4]
- latency LangChain dependency pulls in a heavy abstraction stack, increasing bundle size and inference overhead.[3]
- ops Free tier caps at 100K node executions/month; self-hosting the platform requires an enterprise contract.[5]
- maintenance State schema migrations are painful as domain requirements evolve.[2]
CrewAI
⭐ 52.7k- cost Up to 3× higher token consumption than LangGraph on simple single-tool workflows — a silent cost multiplier at scale.[4]
- debug Without Flows, control flow is encoded into agent prompts — debugging is prompt editing, not state inspection.[2]
- ops Delegation loops consume tokens with no built-in circuit-breaker; standard Python logging doesn't propagate cleanly inside Task callbacks.[1]
- price AMP Suite (enterprise governance) costs $120K/year; open-source lacks SOC 2 and RBAC needed by regulated buyers.[5]
- ops Enterprise deployments report 20-minute delays in "Pending Run" status at peak load.[4]
OpenAI Agents SDK
⭐ 26.9k- lock-in Hosted tools (Threads, Vector Stores, Code Interpreter) store data on OpenAI's platform — migration is data migration, not just code rewrite.[7]
- durability No native state persistence — a mid-handoff crash loses all conversation context without custom checkpointing infrastructure.[3]
- platform Python-only as of June 2026; TypeScript support has no committed timeline.[3]
- ops GPT model deprecations force system rewrites; pricing changes are outside your control.[2]
Mastra
⭐ 24.7k- ecosystem TypeScript-only — Python-first or data-science teams need a language boundary in their stack.[8]
- maturity Younger ecosystem — fewer copy-paste answers, more first-principles debugging.[18]
- deps Peer dependency conflicts with AI SDK versions have caused friction in upgrades.[8]
- cost Naive conversation memory grows linearly — context window overflows silently inflate token costs as sessions lengthen.[18]
Strands Agents
⭐ 6.0k- determinism Non-deterministic by design — traditional unit tests are insufficient; continuous LLM-as-judge evaluation and statistical baselines are required.[6]
- architecture Orchestrator agent is a SPOF — one bad routing decision propagates through all downstream specialists with no graph-level guard.[16]
- maturity Launched May 2025; fewer battle-tested patterns than LangGraph or CrewAI at equivalent scale.[6]
Google ADK
⭐ 20.0k- lock-in Apache 2.0 license obscures GCP pull — Gemini-optimized internals and experimental MCP adapter (not native) create friction when running off Google Cloud.[3]
- maturity Visual Builder is experimental; smaller community → fewer battle-tested multi-language patterns.[5]
- ops Go/Java agent support adds operational complexity that Python/TS-only shops often underestimate.[17]
MS Agent Framework
⭐ 11.0k- cost GroupChat pattern generates 20+ LLM calls per interaction through accumulated history — cost balloons in any scenario with extended deliberation.[3]
- lock-in Distributed actor model uses Azure Cosmos DB for durability — cross-cloud portability requires rebuilding the state layer.[2]
- migration Migration from AutoGen/Semantic Kernel creates technical debt; AutoGen entered maintenance mode October 2025.[5]
Universal Risks (Framework-Agnostic)
- The 30-point orchestration gap. Framework choice alone moves benchmark performance by up to 30 points on identical models — Claude Opus 4 scored 64.9% vs 57.6% on GAIA across different orchestration scaffolds (Princeton HAL). A generational model upgrade rarely delivers this spread.[4]
- Lab-to-production collapse. The CLEAR framework documents a 37% average gap between lab benchmark scores and production deployment performance. Public leaderboards hide this failure mode entirely — a framework that tops your benchmark suite may behave very differently once real users drive it.[4]
- The 5% survival rate. MIT analysis of 300+ enterprise implementations found only ~5% move from pilot to production. Root causes: absent observability, missing HITL primitives, no cost discipline from day one — not framework choice itself. Any framework picked without these foundations will fail.[4]
- State management lock-in. The hardest lock-in to escape is not code — it's state history. An enterprise that builds six months of customer interaction history on a platform-native stateful runtime cannot migrate to a different model provider without rebuilding the memory layer from scratch.[7]
- Token cost multiplication. LLM API calls are 40–60% of total agent operating costs. A framework adding 40% token overhead doesn't increase costs by 40% — it nearly doubles the largest line item. CrewAI's role-playing overhead (3× vs LangGraph on simple workflows) is the most common surprise in first production invoices.[4]
Migration Patterns
| Path | Trigger | Effort | Key step |
|---|---|---|---|
| CrewAI → LangGraph | Control-flow ceiling hit in prod; token costs spiking; debugging via prompt editing | 1–2 weeks (moderate complexity) | Map each crew to a node, convert process type to explicit graph edges, move shared context to LangGraph's typed state object[1] |
| OpenAI SDK → LangGraph | Model flexibility needed; long-horizon state; handoff chain becomes a branching graph | 1–3 weeks depending on handoff depth | Replace handoff objects with graph edges; add checkpoint backend (Postgres/Redis); extract OpenAI-specific tool schemas to model-agnostic format[2] |
| AutoGen → MS Agent Framework | AutoGen maintenance-mode announcement (Oct 2025); Azure ecosystem investment | Weeks to months (actor model rewrite) | Microsoft provides a migration guide; GroupChat patterns map to distributed actor conversations; existing Semantic Kernel skills plug in directly[5] |
| Any → Strands/ADK | AWS or GCP commitment; existing framework too rigid for dynamic requirements | Low if on respective cloud; high if off-cloud | Port tools as typed functions; replace graph edges with model-driven routing; plan for eval infrastructure from day one[6] |