← Default view
Survey · 23 cites · expedition · April 2026

Deep-research agents in 2026 — what to steal for Scout.

"Deep research" is now table stakes across ChatGPT, Claude, Gemini, Perplexity and Grok[1]. Differentiation is speed, depth and citation accuracy — not whether the feature exists. Scout's interesting comparison class is OSS agents and Claude-Code-native skills.

FILED 2026-04-21FORMAT MDDEPTH EXPEDITION23 CITATIONS15 TOOLS
VERY HIGH HIGH MEDIUM LOW SORT · scout-fit ↓ then ⭐ stars ↓

TL;DR

Closed commercial deep-research products (OpenAI, Perplexity, Gemini, Claude, Grok) are poor fits for a Scout-shaped tool — no on-disk artifact, no steering hints. The interesting comparison class is OSS agents and Claude-Code-native skills. Top targets to port from: 199-biotech's claude-deep-research-skill (disk-persisted citations, multi-persona critique), GPT-Researcher (planner / executor / publisher split), open_deep_research (MCP-native backend), STORM (perspective-guided Q&A).

199-biotech / claude-deep-research-skill

github.com/199-biotechnologies
⭐ 509
Claude Skill MIT
8-phase pipeline: scope → plan → retrieve → triangulate → outline → synthesize → critique → refine → package. Disk-persisted citations survive context compaction.[4]
Depth mechanism
8 phases · critique loop-back · auto-continue past 18k words
Citation rigor
Disk-persisted · DOI/URL hallucination check
VERY HIGH Same substrate as Scout — directly portable.
StealDisk-persisted citations + multi-persona critique + validate→fix→retry max 3 cycles.

GPT-Researcher

github.com/assafelovic
⭐ 26.6k
OSS Apache-2.0
The closest conceptual sibling to Scout. Three-role split: planner / executors / publisher. 2026 additions: tree-shaped Deep Research mode, ~5 min runs at ~$0.40 on o3-mini.[2]
Depth mechanism
Planner+executor+publisher · tree DR mode · ~5 min
Citation rigor
Inline · 20+ sources per run
HIGH Closest analogue — production-ready, since May 2023.
StealThree-role separation. Final writer sees only the evidence bundle — not the noisy search trajectory.
26.6k stars · Apache-2.0github.com/assafelovic ↗

open_deep_research

github.com/langchain-ai
⭐ 11.2k
OSS MIT
MCP-native, model-agnostic via init_chat_model(). LangGraph-based. #6 on Deep Research Bench with 0.4943 RACE on GPT-5.[3]
Depth mechanism
LangGraph · any MCP server pluggable
Bench score
RACE 0.4943 · #6 on Deep Research Bench
HIGH Most "configurable" of the OSS options — closest to a reference impl.
StealMCP-server-as-tool. Swap Tavily / SearXNG / Exa as config, not code.
11.2k stars · MITgithub.com/langchain-ai ↗

Weizhena / Deep-Research-skills

github.com/Weizhena
⭐ 483
Claude Skill MIT
Two phases: outline generation (user can expand it), then deep investigation per item in parallel. HITL checkpoints — approve outline before spending tokens.[18]
Depth mechanism
Outline → parallel deep investigations
HITL
Approve outline before token spend
HIGH HITL pattern translates to Scout's "self-review the outline first."
StealApprove-outline-before-investigation. Scout has expedition plans — tighten the gate.
483 stars · MITgithub.com/Weizhena ↗

Claude Managed Agents

platform.claude.com · 2026-04-01
BETA
API commercial
Hosted agent harness behind managed-agents-2026-04-01 header — sandbox, Bash, file ops, web search/fetch, MCP servers. The Environment / Session / Events model maps cleanly onto "one research run = one Session."[12]
Depth mechanism
Harness-defined · Environment/Session/Events
Citation rigor
Depends on system prompt
HIGH Hosting target if Scout outgrows GitHub Actions.
StealMigration path — Environment / Session / Events maps 1:1 onto research-per-run.
public beta · Apr 2026platform.claude.com ↗

STORM / Co-STORM

github.com/stanford-oval
⭐ 28.1k
OSS MIT
Different shape: simulates conversations between writers with different perspectives + a topic-expert LLM grounded in web sources, then builds the outline from the transcript. +10% absolute coverage, +25% organization vs outline-then-RAG.[14]
Depth mechanism
Persona-guided Q&A · Co-STORM HITL turns
Coverage gain
+10% absolute · +25% organization
MEDIUM Long-form only · authors warn: not publication-ready.
StealPersona-guided Q&A — basis for a future scout-researcher-perspectives specialist.

Perplexity Sonar Deep Research

perplexity.ai · API
94.3% cite
SaaS · API commercial
Only one of the "big five" with a production API. Fastest commercial option (2–3 min runs). ~$0.41 per typical query.[8]
Pricing
$2/$8 per M tok + $5/1k searches
Citation rigor
94.3% Sonar Pro vs ~87% GPT-5.2 DR[6]
MEDIUM API-driven — but opaque. No on-disk artifact.
StealOptional drop-in: delegate the search+synth step from Scout if cost permits.
~$0.41/querypricing ↗

local-deep-researcher

github.com/langchain-ai
offline
OSS MIT
Fully local: any Ollama- or LMStudio-hosted model, SearXNG search, nothing leaves the machine. Loop: query → search → summarize → reflect for gaps → next query, for N cycles.[17]
Depth mechanism
Reflect-and-requery loop · user-set cycles
Citation rigor
Inline markdown sources
MEDIUM Reference for Scout's offline mode if that's ever on the table.
StealReflect-and-requery loop. Cleaner than today's ad-hoc "did I miss anything" check.

smolagents · Open Deep Research

github.com/huggingface
⭐ 26.8k
OSS Apache-2.0
Agents emit Python code instead of JSON tool calls — ~30% fewer steps. 55.15% on GAIA vs OpenAI DR's 67.36%. Known context-window blow-ups; demo unstable.[16]
Depth mechanism
Code-agent · multimodal state handling
GAIA score
55.15% · −12pt vs OpenAI DR
LOW Proof-of-concept. Production gap is real — browser tooling and vision.
StealCode-emission idea, not the agent itself. Less JSON ceremony.
26.8k stars · Apache-2.0github.com/huggingface ↗

OpenAI Deep Research

openai.com · ChatGPT Pro
26.6% HLE
SaaS commercial
o3-tuned agent. Longest runs (15–25 min) and the most essay-like reports. 26.6% on Humanity's Last Exam at launch — frontier in its cohort.[7]
Run length
15–25 min[6]
Citation rigor
87% cited accuracy
LOW Closed · no on-disk artifact · no steering hints.
StealNothing portable. Reference frontier numbers only.
ChatGPT Pro/Plusopenai.com ↗

Claude Research / Advanced Research

anthropic.com · Claude Pro/Max
45 min
SaaS commercial
"Advanced Research" runs up to 45 min across hundreds of sources autonomously. Anthropic renamed the Claude Code SDK to Claude Agent SDK — internal use was dominated by research, video, note-taking, not just coding.[11][21]
Run length
Up to 45 min · hundreds of sources
Citation rigor
Closed metric
LOW Not programmable as artifact. But: Claude-Code-as-research-platform is the sanctioned pattern.
StealValidation that Scout's substrate choice is right. No code to lift.
Pro / Max tieranthropic.com ↗

Gemini Deep Research

google.com · AI Pro
$19.99/mo
SaaS commercial
Differentiates on Workspace: pulls from Gmail, Drive, and the public web simultaneously and drops multi-page reports back into Docs.[10]
Pricing
Google AI Pro $19.99/mo[9]
Differentiator
Gmail/Drive/Docs round-trip
LOW Workspace-locked. Useless for a Scout-style standalone artifact.
StealNothing portable.
Workspace-lockedengagecoders ↗

Grok DeepSearch

x.ai · X Premium
X-native
SaaS commercial
Real-time X / web synthesis. Direct X-timeline access is the one thing competitors can't match — only interesting when the topic is breaking news or social sentiment.[13]
Differentiator
Real-time X/Twitter timeline access
API
None relevant to Scout
LOW X-centric · breaking-news only.
StealNothing portable.

Elicit

elicit.com · academic
99.4% extr.
SaaS commercial
Best-in-class for academic literature. 138M papers + 545K clinical trials. 80% time-saved on abstract screening with quote-level rationale per decision. Systematic Review reports cap at 80 papers.[19]
Coverage
PubMed · ClinicalTrials.gov · 138M papers
Cap
SR reports max 80 papers
LOW PubMed-only — not Scout's brief.
StealPer-claim rationale + source quote. Quality scoring not binary in/out.
academic SaaSelicit.com ↗

FutureHouse · Crow / Falcon / Owl / Phoenix

futurehouse.org · science
science
SaaS · API commercial
Scientific-discovery platform built on Claude[23]. Task-specialized: Crow extracts genes/markers, Falcon does background, Owl checks if a hypothesis was already investigated, Phoenix designs chemistry.[20]
Pattern
One agent per epistemic move
Caveat
Phoenix not as deeply benchmarked
LOW Science-only. Pattern is transferable; product isn't.
StealOne agent per epistemic move with named roles — Scout already does this implicitly via Explore sub-agents.
science platformfuturehouse.org ↗

// BENCHMARKS — what "good" means in 2026

DR Bench · top score
56.13RACE
Cellcog Max (proprietary, Mar 2026)[3]
DR Bench · top OSS
54.92RACE
TrajectoryKit on GPT-OSS / GPT-5.4 (MIT)[3]
GAIA · OpenAI DR
67.36%
Frontier · vs smolagents 55.15%[16]
HLE · OpenAI DR launch
26.6%
Highest in cohort at launch[7]

// PRODUCTION-READY vs EXPERIMENTAL

Production-ready today

  • OpenAI Deep Research, Perplexity Sonar DR, Gemini DR, Claude Research, Grok DeepSearch[1]
  • GPT-Researcher[2], open_deep_research[3], local-deep-researcher[17]
  • Elicit Systematic Review[19], FutureHouse Crow / Falcon / Owl[20]
  • 199-biotech skill[4], Weizhena skill[18]

Experimental / proof-of-concept

  • smolagents Open Deep Research — context-window blow-ups, unstable demo[16]
  • STORM — "cannot produce publication-ready articles" per authors[5]
  • FutureHouse Phoenix — "not as deeply benchmarked, may make more mistakes"[20]
  • Claude Managed Agents — public beta, April 2026[12]

// IDEAS TO STEAL — RANKED

Cross-cutting takeaways across all 15 tools surveyed
FROM · 199-biotech

Disk-persisted citations

Survive context compaction. Single biggest fragility in a long Scout run — fix it once, lift the ceiling on every research depth.[4]

FROM · local-deep-researcher

Reflect-and-requery

After initial synthesis, explicitly list knowledge gaps and fire a second round of searches at them. Cleaner than today's ad-hoc check.[17]

FROM · 199-biotech

Multi-persona critique

Skeptical Practitioner / Adversarial Reviewer / Implementation Engineer pass before final write. Pairs with the existing "every claim has ≥1 URL" check.[4]

FROM · STORM

Perspective-guided outline

Before planning sub-questions, enumerate stakeholder perspectives implied by the topic and ensure the outline covers each.[14]

FROM · Elicit

Per-source credibility tag

Not just "cite the URL" but a one-token tag — official-docs / peer-reviewed / vendor-blog / forum-consensus. Formalize the existing labels.[19]

FROM · GPT-Researcher

Publisher role split

Final writer sees the evidence bundle, not the search trajectory. In Scout terms: synthesize into a citation ledger first, write the markdown second.[2]

FROM · 199-biotech

Auto-continuation past 18k

Recursive sub-agents keep going past Scout's expedition-mode context wall. Lifts the depth ceiling.[4]

FROM · 199-biotech

Fetch today's date first

Trivial. Prevents stale-year queries when the runtime clock disagrees with training data.[4]

FROM · GPT-Researcher + open_deep_research

MCP-first search backend

Both standardized on MCP as the pluggable search interface. Tavily / SearXNG / Exa becomes config, not code.[3][2]

FROM · Claude Managed Agents

Future hosting target

If Scout outgrows GitHub Actions, Environment / Session / Events maps 1:1 onto research-per-run.[12]