← Default view
AI Code Review · Benchmarks & Evaluation · 2026

Evaluation Rubrics and Benchmarks for AI Code Review

50 citations  ·  expedition  ·  vendor claims vs. independent F1 scores

INDEPENDENT BEST: F1 19.4% VENDOR CEILING: 82% (self-reported) 4× GAP — same tools, different referees ⚠ 0 of 5 vendor benchmarks independently reproduced AIDev (3,109 PRs): 12 of 13 agents <60% signal Ground truth ~36% noise in CodeReviewer labels Greptile self: 82%  |  Propel's board: 45%  →  same tool, −37pp
⬤ Independent Research
19%
F1 19.38% — best system on SWRBench[2]
1,000 manually verified PRs · 12 OSS Python repos
Independent researchers · ~90% human-judge agreement

AIDev in-the-wild (3,109 PRs):
12 of 13 agents <60% signal[9]
Copilot worst: 19.79% signal ratio
CRA-only PRs merge at 45% vs 68% human-only
⬤ Vendor Self-Reported
82%
82% catch rate — Greptile's own benchmark[11]
50 bug-fix PRs · dataset designed by winning team
Built, run, and evaluated by Greptile

Others: Qodo 60.1%, Propel 64%, Bito 69.5%,
CodeRabbit 51.2% F1, CodeAnt 51.7% F1
All built their own dataset · all ranked #1
Greptile's self-published benchmark graphic showing 82% catch rate
Greptile's published benchmark — they designed the dataset and ranked #1 [11]
01
All Benchmarks Side by Side
Tool / Study Score Metric & Definition Dataset Size Who Built It Cross-Vendor Reality Check
Vendor Self-Reports — every vendor designed their own dataset and ranked #1
Greptile self-rated 82% Catch rate bug flagged in line-level comment with impact explained 50 PRs · 5 langs · 5 repos Greptile [11] 82% self vs 45% on Propel's board −37pp[21]
Bito self-rated 69.5% Coverage % of known truth-set issues detected 65 known issues · 5 langs Bito [17] Not independently tested
Propel self-rated 64% F-score proprietary dataset and harness proprietary Propel [21] Not independently tested
Qodo 2.0 self-rated 60.1% F1 · recall 56.7% LLM-injected bugs, LLM-judged hits 100 PRs · 580 issues · 7 repos Qodo [15] Not independently tested
Cursor BugBot self-rated 70%+ Resolution rate (→52% at launch) AI confirms author fixed flag before merge BugBench · human-annotated real diffs Cursor [14] 70%+ self vs 49% F-score on Propel's board −21pp[21]
CodeRabbit vendor-promoted 51.2% F1 (#1 of 10) · recall 53.5% online: dev acts on comment = TP ~300k PRs · Martian Bench Martian (promoted by CodeRabbit) [12] Dataset & harness vendor-controlled; online signal gameable[13]
CodeAnt vendor-promoted 51.7% F1 (#3 of 17) online + 50-PR offline gold set 200k+ PRs · Martian Bench Martian (promoted by CodeAnt) [13] Same harness issues as CodeRabbit above
Independent Peer-Reviewed Studies — external dataset, not built by the evaluated vendors
SWRBench independent 19.4% F1 (best: PR-Review + Gemini-2.5-Pro) range across all tested: 4.87–19.38% 1,000 PRs · 12 OSS Python repos Academic researchers [2] ~90% LLM-human judge agreement · multi-review +43.67% F1[1]
AIDev In-the-Wild independent <60% Signal ratio (12 of 13 agents) Copilot worst: 19.79% signal 3,109 real PRs Academic researchers [9] CRA-only merge rate: 45% vs 68% human-only review
CRScore independent 0.54 Spearman correlation with humans conciseness · comprehensiveness · relevance 2,900 human-scored reviews NAACL 2025 [4] BLEU fails: valid model review scores as low as 0.046[4]
CodeRabbit blog post header for Martian benchmark results
CodeRabbit promoting the Martian benchmark where it ranked #1 — Martian's dataset and harness remain vendor-controlled [12]
Cursor BugBot blog post header image
Cursor BugBot — reports 70%+ resolution rate on its own BugBench; scores 49% on Propel's independent board [14]
The contamination problem in one number: Models score 80%+ on SWE-bench Verified, then drop to 46–57% on the contamination-resistant SWE-bench Pro — because public PRs leak into training data.[38] Models identify the buggy file from issue text alone up to 76% of the time without repo access.[42] 32.67% of o3 "successes" involved solution text that leaked from the issue description.[43] Every code-review benchmark built on public PRs inherits this risk.
02
Why the Numbers Don't Reconcile — 4 Methodology Fouls
🚩
Tiny synthetic gold sets
Qodo injects bugs with an LLM ("corrupting the diff while preserving functionality"), then grades hits with another LLM.[15] Greptile's set: 50 PRs, one planted bug each.[11] Construct validity — does catching a seeded bug predict real-world usefulness? — is never argued.
🚩
Four incompatible metric definitions
"Catch rate" (≈ recall, Greptile), "F1" (Martian vendors), "resolution rate" (Cursor), "coverage" (Bito) — four different quantities stacked in one ranking.[20] Greptile self-reports 82%; on Propel's board the same tool scores 45%.[21] Same tool, ~half the number, different referee.
🚩
Benchmark contamination
Public PR histories leak into training. On SWE-bench, models hit 76% accuracy identifying the buggy file from issue text alone — without repo access.[42] Verbatim solution memorization: 11.7–31.6% across models.[42] Any benchmark built on public PRs is measuring recall, not reasoning.
🚩
Near-zero reproducibility
85 LLM-centric ICSE/ASE 2024 papers: 18 shared artifacts, 5 executable, 0 fully reproduced.[44] Vendor benchmarks are self-run, never re-runnable by outsiders.[20] Ground truth is ~36% noise in CodeReviewer data.[8] A 2026 survey of 99 CR-benchmark papers: benchmarks and metrics remain immature.[7]
03
Metric Hierarchy — What to Measure on Your Team
1
Acceptance / Resolved Rate
Did the comment cause a code change? Microsoft's 2015 operational definition of a useful review comment[26] — the "online" signal Martian later industrialized.[13] ~⅓ of 1.5M Microsoft comments were non-useful. Measure on your last 50 merged PRs.
Trusted
2
Precision / Signal-to-Noise Ratio
What fraction of flags are worth reading? At 15% FPR, ~13 of 15 "critical" weekly flags are wrong — engineers learn to dismiss the channel.[25] 46% of developers already distrust AI accuracy; 66% cite "almost right but not quite" as their top frustration.[23] Typical FPR 5–15%; well-tuned tools 5–8%.[22]
Contextual
3
F1 / Recall (headline numbers)
What most vendor benchmarks report. Recall failures are invisible — a missed bug looks like a clean PR. High recall with excess output causes developer fatigue and tool abandonment.[45] Recall-led vendors (Qodo, CodeRabbit) advertise finding more; precision-led vendors (Korbit[19], the deprecated Graphite Diamond[18]) advertise fewer false flags. Treat any external F1 as a prior, not a result.
Marketing
04
What a Credible Evaluation Actually Looks Like
05
Established Rubrics — Use These Today
conciseness · comprehensiveness · relevance
Reference-free. Grounds evaluation in detected code claims and smells. Krippendorff α 0.85–0.89. Spearman 0.54 with humans. NAACL 2025. Exposes BLEU failure: valid review can score 0.046.
useful ↔ triggers nearby code change
Behavioral, binary. 1.5M comments at Microsoft; ~⅓ non-useful. Gold standard for acceptance-rate tracking. Directly usable as your in-house evaluation signal.
Code · Text · Voice · Jargon features
Predicts usefulness via classifier (F1/AUC/MCC). >1 in 3 comments lack utility. Extends the Microsoft definition with feature decomposition for model training.
correctness · completeness · relevance
Comprehensiveness-aware. Uses SNR as proxy for developer trust — high recall with excess output causes fatigue.[45] LLM-as-judge with cross-provider judge recommended.[35]
06
Series: AI-Assisted Code Review in 2026