Update — 2026-04-29. Eight days later, most of the gaps this survey named are closed in Scout ⭐ 0 (Apr 2026) [24]: disk-persisted
citations.jsonl, reflect-and-requery, perspective-guided outline (deep tier’s “breadth heuristic”), per-sourcesource_typetaxonomy, GitHub-stars on every link, planner/researcher/writer split fordeep, post-write reviewer, date-injection before querying, format auto-selection, topic sharpening, multi-angle decomposition, and bespoke per-page HTML “views” via thescout-view-authorskill [24][25]. A real head-to-head also ran on issue #7 [26] — Scout vs 199-biotech vs Weizhena on the same topic — and the verdict is honest: Scout wins on density, named picks, paste-ready code, unattended single-shot, and publication; loses on corpus breadth, epistemic honesty (no per-claim confidence, no[uncertain]), and re-renderability. See the Status of “ideas to steal” and Head-to-head verified sections below.
TL;DR for a Scout-shaped tool
“Deep research” has become table stakes: ChatGPT, Claude, Gemini, Perplexity, and Grok all ship an agentic research feature at the $20/month tier; differentiation is now speed, depth, and citation accuracy, not whether the feature exists [1]. If Scout’s job is a cited, reproducible, Jekyll-published artifact tied to a custom source rubric, the commercial products are poor fits (closed, no on-disk artifact, no steering hints). The interesting comparison class is OSS agents + Claude-Code-native skills.
Scout-relevant picks to steal from:
- GPT-Researcher — planner/executor/publisher split, parallel crawler, tree-shaped Deep Research mode [2].
- langchain-ai/open_deep_research — MCP-native, model-agnostic, #6 on Deep Research Bench [3].
- 199-biotechnologies/claude-deep-research-skill — 8-phase pipeline with disk-persisted citations, multi-persona critique, auto-continuation past context windows [4].
- STORM — perspective-guided question generation; Co-STORM’s human-in-the-loop turn policies [5].
The landscape at a glance
| Tool | Kind | License / access | Depth mechanism | Citation rigor | Scout-fit |
|---|---|---|---|---|---|
| OpenAI Deep Research | SaaS | ChatGPT Pro/Plus | o3-tuned agent, 15–25 min runs | 87% cited accuracy [6], 26.6% HLE [7] | low — closed, no artifact, no steering |
| Perplexity Sonar Deep Research | API + SaaS | Commercial, $2/$8 per M tok + $5/1K searches [8] | Short loops, 2–3 min [6] | 94.3% Sonar Pro [6] | medium — API-driven, but opaque |
| Gemini Deep Research | SaaS | AI Pro $19.99/mo [9] | Gmail/Drive + web, multi-page reports [10] | closed metric | low — Workspace-locked |
| Claude Research / Advanced | SaaS | Claude Pro/Max | Up to 45 min, hundreds of sources [11] | closed metric | low — not programmable as artifact |
| Claude Managed Agents | API (beta, 2026-04-01) | Commercial | Harness: sandbox, tools, web [12] | depends on system prompt | high — could host Scout |
| Grok DeepSearch | SaaS | X Premium | Real-time X/web synthesis [13] | closed metric | low — X-centric |
| GPT-Researcher | OSS | Apache-2.0, 26.6k stars (Apr 2026) [2] | Planner+execution+publisher; tree Deep Research [2] | inline, 20+ sources | high — closest analogue |
| open_deep_research | OSS | MIT, 11.2k stars (Apr 2026) [3] | LangGraph; MCP; #6 DR Bench 0.4943 [3] | depends on search tool | high — swap backend |
| STORM / Co-STORM | OSS | MIT, 28.1k stars (Apr 2026) [5] | Perspective-guided Q&A [14] | Wikipedia-style, polished | medium — long-form only |
| smolagents Open Deep Research | OSS | Apache-2.0, 26.8k stars (Apr 2026) [15] | Code-agent; 55.15 GAIA vs OpenAI 67.36 [16] | proof-of-concept | low — experimental |
| local-deep-researcher | OSS | MIT | Ollama/LMStudio; iterative reflect-and-requery [17] | inline markdown sources | medium — offline mode |
| 199-biotech claude-deep-research-skill | Claude Skill | MIT, 509 stars [4] | 8-phase; critique loop-back; auto-continue [4] | disk-persisted citations | very high — same substrate |
| Weizhena Deep-Research-skills | Claude Skill | MIT, 483 stars [18] | Two-phase: outline + deep [18] | inline | high — HITL checkpoints |
| Elicit | SaaS (academic) | Commercial | 138M papers + 545K trials, abstract screening [19] | 99.4% field-extraction accuracy [19] | low — PubMed-only |
| FutureHouse (Crow/Falcon/Owl/Phoenix) | SaaS + API (science) | Commercial | Modular agents per task [20] | paper-grounded | low — science-only |
Commercial “Deep Research” products
OpenAI Deep Research
Launched February 2025 in ChatGPT Pro; powered by an o3 variant tuned for browsing and analysis; 26.6% on Humanity’s Last Exam at launch — highest in its cohort [7]. Runs are the longest (15–25 min) and the reports the most essay-like [6]. The sibling OSS reproduction on smolagents hits 55.15% on GAIA against OpenAI’s 67.36% [16] — so the production gap is still real, mostly in browser tooling and vision.
Perplexity Sonar Deep Research
Only one of the “big five” with a production API, at $2/$8 input/output per million tokens on the base rate; Deep Research stacks $2/1M citation tokens, $3/1M reasoning tokens, and $5 per 1K searches — a full query typically lands around $0.41 [8]. Fastest of the commercial options (2–3 min) and claims 94.3% citation accuracy on Sonar Pro vs ~87% for GPT-5.2 Deep Research [6]. The API is the obvious drop-in if you want to delegate the search+synth step from Scout.
Gemini Deep Research
Lives inside Google AI Pro at $19.99/mo and differentiates on Workspace: it pulls from Gmail, Drive, and the public web simultaneously and drops multi-page reports back into Docs [9][10]. Useless for a Scout-style standalone artifact unless you live in Workspace.
Claude Research / Advanced Research
Anthropic calls it “Research” and “Advanced Research” — the latter runs up to 45 min across hundreds of sources autonomously [11]. The important thing for Scout is that Anthropic renamed the “Claude Code SDK” to “Claude Agent SDK” precisely because internal use was dominated by research, video, and note-taking — not just coding [21]. Claude-Code-as-research-platform is the officially sanctioned pattern.
Claude Managed Agents (beta, Apr 2026)
Hosted agent harness released in public beta behind the managed-agents-2026-04-01 header; it provides the sandbox, Bash, file ops, web search/fetch, and MCP servers that Scout currently gets from the local runner [12]. If Scout ever wants cloud execution without the GitHub Actions runner, this is the migration path — and the Environment/Session/Events model maps cleanly onto “one research run = one Session.”
Grok DeepSearch
Only interesting when the topic is breaking news or social sentiment — direct X-timeline access is the one thing competitors can’t match [13]. No API relevant to Scout.
Open-source agents
GPT-Researcher
The closest conceptual sibling to Scout, and the older one — May 2023, Apache-2.0, 26.6k stars (Apr 2026) [2]. Architecture is a clean three-role split: planner generates research questions from the query, execution agents crawl 20+ web sources in parallel, publisher aggregates into the final report with inline citations [2]. 2026 additions: a recursive Deep Research mode with tree-shaped exploration and configurable depth/breadth (~5 min, ~$0.40 per run on o3-mini), AI-generated inline illustrations via Gemini, LangSmith tracing, MCP integration, and local-document research [2]. Production-ready.
LangChain open_deep_research
MIT, 11.2k stars, built on LangGraph, model-agnostic via init_chat_model(), Tavily by default but supports native Anthropic/OpenAI web search and any MCP server [3]. Ranked #6 on DeepResearch Bench with a 0.4943 RACE score using GPT-5 [3]. Deployable via LangGraph Platform or Open Agent Platform. The most “configurable” of the OSS options — closest to a reference implementation.
Stanford STORM / Co-STORM
MIT, 28.1k stars [5]. Different shape: instead of one planner, STORM simulates conversations between writers with different perspectives and a topic-expert LLM grounded in web sources, then uses that transcript to build the outline [14]. Measurably broader coverage (+10% absolute) and better organization (+25%) vs outline-then-RAG baselines [14]. Co-STORM adds human-in-the-loop turn policies. Outputs are Wikipedia-style; Stanford explicitly warns they’re not publication-ready [5].
Hugging Face smolagents + Open Deep Research
Apache-2.0, 26.8k stars [15]. Core bet: agents emit Python code instead of JSON tool calls, yielding ~30% fewer steps and better multimodal state handling [16]. The Open Deep Research example hit 55.15% on GAIA — a 22-point jump over JSON-based agent baselines, but still 12 points behind OpenAI Deep Research’s 67.36% [16]. Still a proof-of-concept; known context-window blow-ups and demo instability [16].
LangChain local-deep-researcher
MIT. Fully local: any Ollama- or LMStudio-hosted model, SearXNG for search, nothing leaves the machine [17]. Loop is explicit — generate query → search → summarize → reflect for knowledge gaps → generate next query, for a user-specified number of cycles, ending in a markdown summary with sources [17]. Good reference for Scout’s offline mode if that’s ever on the table.
Claude-Code-native research skills
These matter most for Scout because they share the substrate: a .md skill file + Markdown-only artifacts + no-lock-in.
199-biotechnologies/claude-deep-research-skill
MIT, 509 stars [4]. 8 phases: Scope → Plan → Retrieve (parallel search + sub-agents) → Triangulate → Outline Refinement → Synthesize → Critique → Refine → Package [4]. Ideas worth porting:
- Disk-persisted citations that survive context compaction — this is the single biggest fragility in a long Scout run [4].
- Multi-persona critique (Skeptical Practitioner, Adversarial Reviewer, Implementation Engineer) before the final write [4].
- Validation loop: validate → fix → retry, max 3 cycles with 9 structural checks plus DOI/URL hallucination detection [4].
- Auto-continuation via recursive sub-agents for reports >18K words [4].
- Date fetch before searching — so the model doesn’t use stale year assumptions in queries [4].
Weizhena/Deep-Research-skills
MIT, 483 stars [18]. Two phases: outline generation (user can expand it) then deep investigation per item in parallel [18]. The HITL checkpoints are the lesson — approve outline before spending tokens on investigation. Scout is autonomous, so this maps to “self-review the outline before committing,” not a literal user prompt.
Three Ways to Build Deep Research with Claude (paddo.dev)
Taxonomy for the design space itself [22]:
- DIY recursive spawning — ~20 lines of shell, parallel researcher instances, unlimited depth; pays in token cost and no visibility. Scout’s current shape.
- MCP plug-and-play — drop in DuckDuckGo, Semantic Scholar, etc.; low ceiling, big upfront context tax.
- Production research product — multi-source claim verification, progress streaming, cost tracking; high engineering + maintenance.
Scout sits between (1) and (3). The direction to move is lift-specific ideas from (3) — not to become (3).
Domain-specific agents
Elicit
Best-in-class for academic literature, not general research. 138M papers + 545K clinical trials, PubMed/ClinicalTrials.gov search, 80% time saved on abstract screening with every screening decision backed by quote-level rationale, 99.4% data-extraction accuracy on a real German policy review [19]. Systematic Review reports cap at 80 papers. Ideas worth stealing: per-claim rationale + source quote, and quality scoring of each source against the screening criteria, not a binary in/out.
FutureHouse (Crow, Falcon, Owl, Phoenix)
Scientific-discovery platform built on Claude [23]; agents are task-specialized — Crow extracts genes/markers from papers, Falcon does background, Owl checks whether a hypothesis has already been investigated, Phoenix designs chemistry [20]. The transferable pattern is one agent per epistemic move rather than one generalist — Scout already does this implicitly via Explore sub-agents, but not with distinct named roles.
Benchmarks: what “good” means in 2026
DeepResearch Bench: 100 PhD-level tasks across 22 fields, scored on RACE (report quality, reference-based, adaptive criteria) and FACT (retrieval + citation trustworthiness) [3]. Current top: Cellcog Max (proprietary, 56.13 overall, Mar 2026) and TrajectoryKit on GPT-OSS/GPT-5.4 (MIT, 54.92) [3]. GAIA (general assistant benchmark) is where smolagents tops out at 55.15% against OpenAI DR’s 67.36% [16]. Humanity’s Last Exam is the hardest: OpenAI Deep Research scored 26.6% at launch, which was the frontier at the time [7].
None of these benchmarks directly evaluate the thing Scout does — citation-inline markdown with per-topic source rubrics. The closest proxy is FACT’s citation-trustworthiness axis.
Ideas to steal for Scout (status, Apr 2026)
Restructured 2026-04-29: status markers reflect what Scout has actually shipped vs what’s still open. Open items first.
Still open
- ✗ Multi-persona critique pass before writing — 199-biotech runs Skeptical Practitioner + Adversarial Reviewer + Implementation Engineer over the draft and surfaces unsupported claims [4]. Scout’s
deeptier has a singlescout-reviewersub-agent [25] — same shape, one persona; splitting into two-three named personas is cheap and worth doing. - ✗ Per-claim confidence labels (High / Med / Low) and an explicit Claims-Evidence Table — 199-biotech appends a table mapping load-bearing claims → sources → confidence rating [4]. Scout’s “no claim without URL” rule is weaker because it doesn’t surface which claims are thinly supported. Cheapest big win flagged in the verified head-to-head below.
- ✗
[uncertain]markers on unverifiable fields (Weizhena pattern) [18] — and a tail list of the exact uncertain field paths. Honest epistemics, prompt-level change. - ✗ Editable analytic schema before research (
fields.yamlstyle) — Weizhena lets the user define ~50 analytic axes per item before/research-deepfans out [18]. Scout has no equivalent lever; depth/format flags don’t shape the analytic axes. Architectural lift but worth piloting on survey-shaped topics. - ✗ Research/render split — Weizhena emits per-item JSON (
results/*.json) and agenerate_report.pythat synthesizes the markdown; you can re-render against a different template without re-researching [18]. Scout’sindex.mdIS the artifact. The newscout-view-authorskill is partial coverage (you can author alternate visual treatments without re-running) but the data shape is still narrative-only, not structured. - ✗ MCP-first search backend — both open_deep_research ⭐ 11.2k (Apr 2026) [3] and GPT-Researcher ⭐ 26.6k (Apr 2026) [2] standardized on MCP as the pluggable search interface — swapping Tavily / SearXNG / Exa becomes config. Scout still uses
WebSearch+WebFetch+playwrightfallback hardcoded in the skill [24]. - ✗ Source-type breakdown in the artifact — derived from
citations.jsonl(Anthropic / GH / security firms / blogs etc.). Scout already tagssource_typeper ledger entry [24]; surfacing the histogram in the artifact is a one-prompt change. - ✗ Methodology metadata block — 199-biotech appends a footer naming phases, source counts, triangulation rule, persona names, validation status [4]. Cheap auditability win.
- ✗ Claude Managed Agents as a hosting target — if Scout outgrows the self-hosted GitHub Actions runner [12]. Not a current pain point but the migration path is open.
Shipped or partial
- ✓ Disk-persisted
{claim, url}store — Scout writescitations.jsonlforstandardanddeep, withnmatching every[[n]]in the body [24]. Schema includessource_type,quote,github_stars. Hard rule: every[[n]]has a ledger entry; every entry has a non-emptyurl. - ✓ Reflect-and-requery loop —
standardruns one explicit gap-listing + targeted requery round;deepdoes the same per researcher sub-agent plus one parent remediation round [24][25]. - ✓ Perspective-guided outline (STORM-shaped) —
deep.md’s breadth heuristic enumerates the chooser / maintainer / skeptic / operator / competitor / history angles before dispatching researchers [25]. - ✓ Per-source credibility taxonomy —
official/peer-reviewed/vendor-blog/forum/news/wikitags on every ledger entry [24]. - ✓ Planner / researcher / writer split —
deeptier: parent plans + writes, researcher sub-agents fan out (max 6 concurrent), each owns its owncitations.a<N>.jsonl,merge_ledgers.shdedupes intocitations.jsonl, parent writes from compressed researcher summaries [25]. Each researcher gets its own 200K context so the parent never sees raw search trajectories. - ✓ Fetch today’s date before querying —
DATEis injected byrun.sh;WebSearchqueries include the literal year [24]. - ◑ Auto-continuation for long reports — sidestepped rather than solved. The researcher sub-agent split keeps the parent context small, so the 18K-word ceiling that drove 199-biotech’s recursive auto-continuation [4] hasn’t been hit. Open if Scout ever wants single-page reports >18K words.
- ◑ Re-renderability — the new
scout-view-authorskill produces bespoke HTML “views” of an existing canonical (<canonical>/views/<view_name>.html) without re-running research [24]. This covers “render the same data with a different visual register” but not “edit the analytic schema and re-emit the matrix” — the canonical is still narrative, not a queryable JSON.
Adjacent capabilities Scout shipped that weren’t in the original “ideas to steal” list
- Topic sharpening — pre-research LLM step proposes a tightened topic with optional decomposition for multi-angle expeditions [24]. User edits in place via GitHub Issue, then ticks Start.
- Decomposition with synthesis — multi-angle topics fan out into one sub-research per angle plus a parent overview that names contradictions and dependencies between children [25].
- Format auto-selection —
autoheuristic picks.mdfor narrative analyses and.htmlfor comparison-heavy / visual topics [24]. Neither commercial nor OSS competitor varies format by topic shape. - Cost / duration in frontmatter —
cost_usdandduration_secinjected byrun.shafter the run, surfaced on the Atlas card. Neither 199-biotech nor Weizhena exposes cost. - Profile-driven localization —
profile.ymlwithlocation/languages/currency/interestslocalizes sharpening (e.g., “best ramen” → “best ramen in Ghent, EUR”) [24].
Head-to-head verified — Scout vs 199-biotech vs Weizhena (Apr 2026)
Same topic (Yolo Claude Code in Docker, issue #7) [26], three tools, three artifacts, all read end-to-end. Numbers below are from the verified comparison run.
Run footprint
| Metric | Scout (deep) |
199-biotech ⭐ 509 [4] | Weizhena ⭐ 483 [18] |
|---|---|---|---|
| Lines | 423 | 621 | 8,901 |
| Words | 4,915 | 9,502 | 232,197 |
| Distinct citations | 77 | 34 (self-reported) | ≥150 across 75 items |
| Unique domains | 38 | not tallied | not tallied |
| Wall-clock | 18.5 min | ~25 min | ~7.5 hr (interactive, multi-phase gates) |
| API cost | $10.70 declared | not disclosed | not disclosed |
| Subscription footprint | “blip on the subscription” | comparable to Scout | exceeded Max-$200 6-hour cap on this single topic — required pause/resume on a fresh window |
| Artifacts | index.md + outline.md + citations.jsonl + 8 per-agent ledgers |
.md + .html (advertised JSONL/PDF not produced) |
report.md + outline.yaml + fields.yaml + generate_report.py + 75 results/*.json |
Where Scout actually loses
- Corpus breadth. Weizhena’s 75-tool sweep names tools Scout doesn’t (Ona Veto, NVIDIA OpenShell, Cloudflare Outbound Workers, kubernetes-sigs/agent-sandbox, Spritz, sbox, claucker, VibePod, rivet-dev, …). On a survey-style topic, Weizhena is the right tool.
- Incident / CVE breadth. Weizhena cites ~9 distinct CVE/incident classes (CVE-2025-59536, CVE-2026-21852, CVE-2025-66032, CVE-2026-25725, CVE-2026-39861, Claudy Day, Mar 2026 source leak, axios RAT, LiteLLM, Ona disclosure, Adversa deny-rule bypass). Scout covers ~5; 199-biotech ~3 (and missed CVE-2025-59536 entirely).
- Epistemic honesty. No
[uncertain]markers, no per-claim confidence labels, no source-type histogram in the artifact. Both competitors do better here in different ways. - Re-renderability. Once a Scout run is done, the template can’t change without re-researching. Weizhena lets you edit
generate_report.pyand re-render fromresults/*.json. - Auditability of major claims. 199-biotech’s Claims-Evidence Table maps 10 load-bearing claims → sources → High/Med/Low confidence; Scout’s “no claim without URL” treats every URL-cited claim as equally weighted.
- Methodology transparency. 199-biotech names its 8 phases, persona names, triangulation rule, source-type breakdown. Scout has citation count + per-agent ledgers but no methodology block in the artifact.
- Track record. Scout ⭐ 0 (Apr 2026), 4 outputs total. 199-biotech (⭐ 509) and Weizhena (⭐ 483) have months of use. Failure modes that show up at scale haven’t been observed yet for Scout.
Where Scout actually wins
- Density for a narrative reader. 4.9k words gets to a decision faster than 9.5k or 232k.
- Picks named with context. “Fastest start: Docker Sandboxes. Specific toolbelt: roll your own. Multi-project: claudebox.” That’s what an expert who’s read enough actually wants. 199-biotech recommends an architecture without comparing alternatives. Weizhena scores all 75 0–5 and lets the reader choose — more honest, less direct.
- Paste-ready code in one place. Complete Dockerfile +
init-firewall.sh+docker run+compose.yamlin one read. 199-biotech splits across sections; Weizhena has snippets per-item. - Currency on 2026 CVEs. Scout names CVE-2025-59536, Cyata, MCP-STDIO, Trail of Bits, Invariant. (Still behind Weizhena’s depth here, but ahead of 199-biotech.)
- Format auto-selection. Scout’s
autoheuristic picked.mdfor this topic; neither competitor varies format by topic shape. - Unattended single-shot. 18.5 min one-shot. 199-biotech ~25 min, also single-shot. Weizhena needs three human gates (
/research,/research-deep,/research-report) — incompatible with overnight or scheduled runs. - Publication pipeline. Scout lands on a public Atlas URL via GitHub Pages. The others sit on disk.
- Cost transparency.
cost_usd: 10.70,duration_sec: 1110in frontmatter. Neither competitor exposes cost. - Subscription footprint. Weizhena exceeded the Max-$200 6-hour cap on this single topic; Scout was a “blip.” (User-observed, not measured rigorously.)
Situational picks
- “What should I do tonight?” with a developer who’ll act on it → Scout. Density + paste-ready code + named picks beats both alternatives for this shape.
- “Map the entire landscape, score everything, let me decide” → Weizhena on output quality. Caveat: the Max-$200 6-hour cap. Scout’s
deep(expedition) tier exists and is shape-similar but hasn’t been benchmarked head-to-head. - “Polished narrative essay with audit trail” → 199-biotech. Claims-Evidence Table + 8-phase methodology block read like a vendor white paper.
- “Reproducibility — re-render with a new template later” → Weizhena (only one with the data/render split).
- “Unattended overnight runs” → Scout or 199-biotech. Weizhena’s interactive gates rule it out.
Bottom line
Scout is good for the use case it’s designed for — terse decision-oriented narrative with paste-ready code, single-shot, published. It is not “best overall” by any objective measure on this topic. Weizhena beats it on breadth, epistemics, and re-renderability. 199-biotech beats it on auditability and methodology transparency. The right benchmark is “for what shape of question?”, not “which one wins?”
The cheapest large wins from this comparison: multi-persona critique (split today’s single reviewer into 2-3 personas), Claims-Evidence Table (load-bearing claims → sources → High/Med/Low confidence appended after the body), [uncertain] markers (prompt-level), source-type histogram (one-prompt change against existing citations.jsonl).
Production-ready vs experimental
| Production-ready today | Experimental / proof-of-concept |
|---|---|
| OpenAI Deep Research, Perplexity Sonar Deep Research, Gemini DR, Claude Research, Grok DeepSearch [1] | smolagents Open Deep Research — context-window blow-ups, unstable demo [16] |
| GPT-Researcher [2], open_deep_research [3], local-deep-researcher [17] | STORM — “cannot produce publication-ready articles” per authors [5] |
| Elicit Systematic Review [19], FutureHouse Crow/Falcon/Owl [20] | FutureHouse Phoenix — “not as deeply benchmarked, may make more mistakes” [20] |
| 199-biotech Claude skill [4], Weizhena skill [18] | Claude Managed Agents — public beta, April 2026 [12] |