- Want a self-hosted reviewer → PR-Agent, or OpenHands ⭐ 76k whose agent both fixes issues and posts PR reviews[19].
- Building your own review pipeline → LangGraph ⭐ 34k is the production default; CrewAI for quick prototypes[24][25].
- The evidence is sobering. The largest independent benchmark (200k+ real PRs) clusters top tools near ~52% precision / ~51% recall[50]; field studies find only 0.9–19.2% of AI comments get acted on vs ~60% for humans[55]. AI review is a useful-but-noisy assistant, not an autopilot[52].
This is the open-source companion to the survey's commercial market map. The commercial vendors (CodeRabbit, Graphite, Qodo, Greptile) are covered separately; here the focus is what you can read, fork, self-host, and cite — the OSS review tools, the autonomous agents, the frameworks that build them, the open-weight models underneath, and the benchmark/empirical research that grounds it all.
The defining 2026 event was Qodo donating PR-Agent to a fully community-owned GitHub org and restoring the Apache-2.0 license[1]. The community repo ships six tools — Describe, Review, Improve, Ask, Help Docs, Update CHANGELOG — across GitHub, GitLab, Bitbucket, Azure DevOps and Gitea, deployable as a GitHub App/Action, Docker container, pip CLI, or webhook, with bring-your-own LLM keys (OpenAI, Claude, Gemini, local Ollama)[3][10]. It is explicitly the community legacy foundation, distinct from the commercial Qodo Merge, which keeps the proprietary RAG context engine, embedding models and SOC 2 compliance behind the paywall[3]. Independent testing on a 450K-file monorepo rated PR-Agent the most capable OSS reviewer but flagged heavy self-host configuration overhead[10].
The critical caveat: "open-source AI code review" is mostly open-core. Sourcery's MIT repo contains only IDE plugins and GitHub-App glue — the actual AI engine is a proprietary cloud backend that cannot be self-hosted[4][5]. CodeRabbit is fully proprietary SaaS (free only for public repos). The genuinely fork-and-run options are smaller:
| Tool | Stars | License | Kind | What it is |
|---|---|---|---|---|
| PR-AgentOSS | ⭐ 12k | Apache-2.0 | AI, self-host | Community-owned flagship; 6 tools, 5 forges, BYO-LLM[1][2] |
| ChatGPT-CodeReviewOSS | ⭐ 4.4k | ISC | AI, self-host | Minimal GitHub-App PR reviewer, BYO OpenAI key[7] |
| Danger / JSOSS | ⭐ 5.7k / 5.5k | MIT | rules, non-AI | PR-convention enforcement (changelog, size, assignees)[6] |
| shippieOSS | ⭐ 2.4k | MIT | AI, self-host | ex–code-review-gpt; CI-native LLM reviewer[8] |
| Sourceryopen-core | ⭐ 1.8k | MIT (shell) | engine proprietary | Repo is IDE plugins only; AI runs in Sourcery's cloud[4] |
| ai-codereviewerdormant | ⭐ 1.0k | MIT | AI, self-host | Widely-forked minimal Action; no push since 2024[9] |
Stars verified via GitHub API where reachable (PR-Agent, Sourcery, Danger) on 2026-06-09; others from researcher capture earlier in the run.
Open-source coding agents split into two roles: agents built to write code (changes become reviewable as a side effect) and agents that review diffs directly. The academic anchor is SWE-agent ⭐ 19k (Princeton/Stanford) — a minimal agent-computer interface for autonomously resolving GitHub issues[12]. Its production sibling OpenHands ⭐ 76k (ex-OpenDevin) is the standout that actually reviews: a dedicated PR-review workflow fires on a review-this label or by requesting openhands-agent as reviewer, posting code-quality / security / best-practice feedback in 2–3 minutes[19], plus a GitHub Resolver Action that auto-fixes labeled issues and opens PRs[20]. On SWE-bench Verified with Opus 4.5, OpenHands scores 77.6% vs SWE-agent's 72.0% — "write a paper, use SWE-agent; ship a product, use OpenHands"[11].
| Agent | Stars | Role | Reviews diffs? | Notes |
|---|---|---|---|---|
| OpenHands | ⭐ 76k | write + review | ✓ dedicated PR-review workflow | Bash/edit/browser/Jupyter tools; GitHub Resolver[13][19] |
| Cline | ⭐ 63k | write (IDE) | ⚠ diff previews, per-step approval | IDE-native; mandatory human approval each step[15] |
| gpt-engineerdormant | ⭐ 55k | write | ✗ | Whole-project codegen; Lovable precursor[17] |
| Aider | ⭐ 46k | write (terminal) | ⚠ git-diff + architect mode | Auto-commits each change; review mode still a feature request[14][21] |
| SWE-agent | ⭐ 19k | write (research) | ✗ (issue-resolution scaffold) | Canonical academic reference; v1 ~43.2% on Verified[12][23] |
| Devikadormant | ⭐ 20k | write | ✗ | First OSS "Devin clone"; now stalled[18] |
| gptme | ⭐ 4.3k | write (terminal/CI) | ✗ | Provider-agnostic; runs headless in CI[16] |
Aider also maintains the Polyglot Leaderboard as a multi-language counterweight to SWE-bench, which is 100% Python with Django nearly half its cases[22]. Frontier SWE-bench Verified sat in the high 80s in 2026 — one leaderboard reading puts GPT-5.5 at ~88.7% (May 2026)[23], though other tallies land nearer ~82%[37], a spread that underscores how much harness and date move the number — yet the same open scaffold drops to ~43% on a weaker model, so scaffold and model both matter[23].
If you build your own planner → diff-analysis → comment-synthesis pipeline, LangGraph ⭐ 34k is the 2026 production default — repeatedly ranked #1, with the explicit state graph, checkpointing, streaming and human-in-the-loop primitives a multi-stage review bot needs (e.g. security-agent → performance-agent → report-agent)[24][25]. CrewAI ⭐ 53k dominates fast role-based prototyping but teams migrate to LangGraph for production state management[30]. Pydantic AI ⭐ 18k emerged as the type-safe, schema-defined alternative with OpenTelemetry instrumentation and agent-delegation[31][32].
The big governance story is on Microsoft's side: AutoGen ⭐ 59k (still highest star count) and Semantic Kernel both entered maintenance mode, converging into the new Microsoft Agent Framework ⭐ 11k (public preview Oct 2025, 1.0 GA targeted end-Q1 2026)[26][27][28]. Meanwhile AutoGen's original authors forked it into AG2 ⭐ 4.7k — Apache-2.0, community-governed, retaining the legacy GroupChat API and the original PyPI packages[29].
| Framework | Stars | Fit for review pipelines | Status |
|---|---|---|---|
| AutoGen | ⭐ 59k | Conversational multi-agent (research-style) | maintenance → Agent Framework[26] |
| CrewAI | ⭐ 53k | Role-based crews; fast prototyping | active[30] |
| LangGraph | ⭐ 34k | State graph + HITL — #1 for production review bots | active[24] |
| Pydantic AI | ⭐ 18k | Type-safe, schema-defined; agent delegation | active[31] |
| MS Agent Framework | ⭐ 11k | AutoGen + Semantic Kernel convergence | preview → GA Q1 2026[28] |
| AG2 | ⭐ 4.7k | Legacy AutoGen GroupChat, community-led | active fork[29] |
For evaluation, OpenHands ships a published, reproducible SWE-bench harness (OpenHands/benchmarks) that has become de-facto infrastructure for measuring coding/review agents[33][34] — and harness choice alone can swing scores 15–20 points across Aider, OpenHands, SWE-Agent and Plandex[35].
The evaluation landscape splits cleanly into issue-resolution benchmarks (does the patch pass tests?) and review-comment-quality benchmarks (is the comment useful?). Conflating them is the most common analytical error in the space.
| Benchmark | Measures | Scale | Headline numbers |
|---|---|---|---|
| SWE-bench Verified | Issue resolution (test-passing patch) | Real GitHub issues | ~82–88% frontier 2026 (harness/date-dependent)[36][37] |
| SWE-bench Multimodal | Visual/UI issue resolution | 619 tasks, 17 JS repos | collapses to ~12% resolve[38][39] |
| CodeReviewer | Quality est. / comment gen / refinement | ~150k train, 9 langs | Foundational MSFT dataset (2022)[40][41] |
| CRScore (NAACL'25) | Reference-free review-comment quality | ~2.9k annotated scores | 0.54 Spearman w/ humans — best OSS metric[42][43] |
| CodeReviewQA (MSR'25) | Review comprehension (3 reasoning stages) | 900 curated, 9 langs | Decontaminated comprehension probe[44] |
| Martian Code Review Bench | Offline (known issues) + online (accept/reject) | 200k+ PRs, 1.2M+ changes, daily | Vendor-neutral; the 2026 reference[45][46] |
| RepoBench (ICLR'24) | Repo-level completion (retrieval/completion/pipeline) | Python + Java | Context-length stratified[49] |
Beware vendor "#1" claims. They cite different configs and time windows on the same Martian benchmark: Qodo claims #1 at F1 64.3% (its Extended research-preview config; Standard production is F1 47.9%, #4)[47], while CodeRabbit claims #1 at F1 51.2% on the online benchmark over ~300k PRs[48]. Both can be "true" because they rank on different slices.
Empirical work from 2025–2026 converges on one picture: useful-but-noisy, never trustworthy enough for full automation.
- Verbosity & low adoption. Across 300 OSS projects (278,790 conversations), AI comments were ~7× more verbose than humans (29.6 vs 4.1 tokens/line), >95% concentrated on defect/improvement, and adopted only 16.6% of the time vs 56.5% for humans[56].
- Noise kills trust. A two-phase field study warns that low-value AI comments make reviewers stop reading — and then miss the real issues[51].
- Hallucination is a supply-chain risk. "Slopsquatting": 19.7% of sampled recommended packages were non-existent, 58% repeating across queries; standard AI review carries reported 5–15% false-positive rates[57].
- Security framing is exploitable. "Bug-free" framing degrades vulnerability detection (confirmation bias); an iterative refinement attack hit 100% success across 17 CVEs in 10 projects[54].
- The clear win is narrow. LLMs used to filter static-analysis alarms cut a 76%+ false-positive baseline by 94–98% at Tencent — review as a triage layer, not an oracle[53].
The throughline: hunk-level and manually-triggered tools outperform whole-PR auto-runs (12.8% vs 6.8% addressing rate), and human-in-the-loop is the consistent recommendation[55][52].
The lineage runs through three architectural eras, then an agentic surge.
- 2020CodeBERT — first bimodal pre-trained Transformer over code + NL; SOTA code search/documentation. Encoder era[58].
- 2021CodeT5 — identifier-aware encoder-decoder on 8.35M functions; the seq2seq backbone[59].
- 2022CodeReviewer — initialized from CodeT5, pre-trained on real PR data in 9 languages; the pivotal review-specific node[40].
- 2024Open-weight code LLMs reach parity — DeepSeek-Coder beats CodeLlama-34B and closed Codex/GPT-3.5[60]; StarCoder2 trains on The Stack v2 incl. GitHub PRs[61]; Qwen2.5-Coder-7B beats DeepSeek-Coder-33B, making small self-hostable review viable[62]; DeepSeek-Coder-V2 brings open MoE[63].
- 2025Agentic surge — coding-agent adoption hits 22–29% across 128,018 GitHub projects in H1 (Cursor, Claude Code, Codex)[64].
- 2026Open weights chase the frontier — building on DeepSeek V3 (late 2024), which proved an open-weights lab could match OpenAI/Anthropic on reasoning while releasing weights for free, V4-class MoE models push self-hosted agentic review toward economic viability[65].
Why open-source review surged now: open-weight code LLMs closed the capability gap (a 7B model in 2024 matched a 33B from months earlier[62]), agent scaffolds became reusable infrastructure[13], and the PR-Agent donation gave the ecosystem an Apache-2.0 flagship to rally around[1]. The constraint is no longer the model — it's that, as the evidence section shows, the review-comment-quality problem (precision, noise, trust) remains stubbornly unsolved[50].