50 citations · expedition · vendor claims vs. independent F1 scores
| Tool / Study | Score | Metric & Definition | Dataset Size | Who Built It | Cross-Vendor Reality Check |
|---|---|---|---|---|---|
| Vendor Self-Reports — every vendor designed their own dataset and ranked #1 | |||||
|
|
82% | Catch rate bug flagged in line-level comment with impact explained | 50 PRs · 5 langs · 5 repos | Greptile [11] | 82% self vs 45% on Propel's board −37pp[21] |
|
|
69.5% | Coverage % of known truth-set issues detected | 65 known issues · 5 langs | Bito [17] | Not independently tested |
|
|
64% | F-score proprietary dataset and harness | proprietary | Propel [21] | Not independently tested |
|
|
60.1% | F1 · recall 56.7% LLM-injected bugs, LLM-judged hits | 100 PRs · 580 issues · 7 repos | Qodo [15] | Not independently tested |
|
|
70%+ | Resolution rate (→52% at launch) AI confirms author fixed flag before merge | BugBench · human-annotated real diffs | Cursor [14] | 70%+ self vs 49% F-score on Propel's board −21pp[21] |
|
|
51.2% | F1 (#1 of 10) · recall 53.5% online: dev acts on comment = TP | ~300k PRs · Martian Bench | Martian (promoted by CodeRabbit) [12] | Dataset & harness vendor-controlled; online signal gameable[13] |
|
|
51.7% | F1 (#3 of 17) online + 50-PR offline gold set | 200k+ PRs · Martian Bench | Martian (promoted by CodeAnt) [13] | Same harness issues as CodeRabbit above |
| Independent Peer-Reviewed Studies — external dataset, not built by the evaluated vendors | |||||
|
|
19.4% | F1 (best: PR-Review + Gemini-2.5-Pro) range across all tested: 4.87–19.38% | 1,000 PRs · 12 OSS Python repos | Academic researchers [2] | ~90% LLM-human judge agreement · multi-review +43.67% F1[1] |
|
|
<60% | Signal ratio (12 of 13 agents) Copilot worst: 19.79% signal | 3,109 real PRs | Academic researchers [9] | CRA-only merge rate: 45% vs 68% human-only review |
|
|
0.54 | Spearman correlation with humans conciseness · comprehensiveness · relevance | 2,900 human-scored reviews | NAACL 2025 [4] | BLEU fails: valid model review scores as low as 0.046[4] |