The single most important finding across all six angles is the 4× evidence gap. The largest independent review benchmark — SWRBench, 1,000 manually verified PRs — puts the best system at F1 ≈ 19.4% [1], and a separate in-the-wild study of 19,450 real PRs found 12 of 13 agents averaged below 60% signal, with 60% of agent-only PRs in the 0–30% signal band [2]. Vendor benchmarks claim 50–82% [3][4][5]. The evaluation rubrics angle explains why this isn’t lying: incompatible “caught” definitions, LLM-seeded synthetic bug sets, and the systematic absence of false-positive reporting make the numbers mutually incomparable. The one transferable metric is precision on your own PRs — acceptance rate is the honest proxy.
Agentic retrieval beyond the diff is the real architectural differentiator, not model choice. “When the retrieval layer is ‘the diff plus 100 lines around it,’ every AI reviewer regresses to the same ceiling” [6]. Tools that build whole-codebase dependency graphs (Greptile, Qodo’s Context Engine, Claude Code Review’s multi-agent aggregator) structurally outperform single-pass reviewers on cross-file and architectural defects. The same architecture introduces an unpatched verification gap: when one model plans, acts, and grades its own output, self-review bias compounds. Multi-agent fan-out only helps when subtasks are genuinely independent — over-decomposing is a documented production failure mode.
The security attack surface is live and structurally underreported. The April 2026 Comment-and-Control disclosure used instructions hidden in HTML comments (invisible in rendered Markdown) to hijack Anthropic, Google, and GitHub review agents into exfiltrating API keys — the attack fires automatically on pull_request events with no attacker interaction required [7]. CVE-2025-59145 (CamoLeak, CVSS 9.6) exfiltrated source code one character at a time through GitHub’s own image proxy [8]. These are orthogonal to the SAST/SCA/secrets-scanning foundation covered in the security tooling angle — Semgrep, Snyk, and GitGuardian don’t scan for AI prompt injection. The two AppSec layers address different threat surfaces and must both be deployed; neither substitutes for the other.
Developer trust has inverted despite high adoption. 84% of developers now use AI tools but trust in accuracy has fallen to 29% (down from 40%), with 46% actively distrusting output [9]. The METR randomized controlled trial (16 experienced developers, 246 tasks) measured a 19% slowdown while participants believed they were 20% faster [10]. DORA 2025 quantifies the organizational version: PR review time up 91%, PR size up 154%, and delivery throughput flat — a verification tax where time saved writing is reabsorbed checking [11]. Alert fatigue is the mechanism that converts high recall into zero value: well-tuned tools run at 5–15% false-positive rates [12], which scales to ~13 spurious critical flags per reviewer per week; teams learn to dismiss the channel entirely within roughly 13 weeks [13]. The precision-vs.-recall strategic split between tools (CodeRabbit, Qodo optimize recall; the now-deprecated Graphite Diamond optimized precision) maps directly onto this failure mode.
The market has consolidated around 19 commercial vendors in four segments, with a self-hostable OSS core: PR-Agent ⭐ 11.5k is the engine under Qodo Merge, and Sourcery ⭐ 1.8k remains the strongest open option for Python/JS refactoring. The EU AI Act goes live 2 August 2026 [14], adding a compliance dimension to vendor selection that most pricing pages have not yet addressed.
The open question the benchmark community hasn’t answered: at what measurable precision/recall threshold does AI code review justify autonomous merge gating — and can the field agree on a single “caught” definition before the question becomes moot?