AI Code Review · Benchmarks & Evaluation · 2026

Evaluation Rubrics and Benchmarks for AI Code Review

50 citations · expedition · vendor claims vs. independent F1 scores

INDEPENDENT BEST: F1 19.4% VENDOR CEILING: 82% (self-reported) 4× GAP — same tools, different referees ⚠ 0 of 5 vendor benchmarks independently reproduced AIDev (3,109 PRs): 12 of 13 agents <60% signal Ground truth ~36% noise in CodeReviewer labels Greptile self: 82% | Propel's board: 45% → same tool, −37pp

⬤ Independent Research

19%

F1 19.38% — best system on SWRBench^[2]
1,000 manually verified PRs · 12 OSS Python repos
Independent researchers · ~90% human-judge agreement

AIDev in-the-wild (3,109 PRs):
12 of 13 agents <60% signal^[9]
Copilot worst: 19.79% signal ratio
CRA-only PRs merge at 45% vs 68% human-only

4×

↕

⬤ Vendor Self-Reported

82%

82% catch rate — Greptile's own benchmark^[11]
50 bug-fix PRs · dataset designed by winning team
Built, run, and evaluated by Greptile

Others: Qodo 60.1%, Propel 64%, Bito 69.5%,
CodeRabbit 51.2% F1, CodeAnt 51.7% F1
All built their own dataset · all ranked #1

Greptile's self-published benchmark graphic showing 82% catch rate

Greptile's published benchmark — they designed the dataset and ranked #1 ^[11]

All Benchmarks Side by Side

Tool / Study	Score	Metric & Definition	Dataset Size	Who Built It	Cross-Vendor Reality Check
Vendor Self-Reports — every vendor designed their own dataset and ranked #1
Greptile self-rated	82%	Catch rate bug flagged in line-level comment with impact explained	50 PRs · 5 langs · 5 repos	Greptile ^[11]	82% self vs 45% on Propel's board −37pp^[21]
Bito self-rated	69.5%	Coverage % of known truth-set issues detected	65 known issues · 5 langs	Bito ^[17]	Not independently tested
Propel self-rated	64%	F-score proprietary dataset and harness	proprietary	Propel ^[21]	Not independently tested
Qodo 2.0 self-rated	60.1%	F1 · recall 56.7% LLM-injected bugs, LLM-judged hits	100 PRs · 580 issues · 7 repos	Qodo ^[15]	Not independently tested
Cursor BugBot self-rated	70%+	Resolution rate (→52% at launch) AI confirms author fixed flag before merge	BugBench · human-annotated real diffs	Cursor ^[14]	70%+ self vs 49% F-score on Propel's board −21pp^[21]
CodeRabbit vendor-promoted	51.2%	F1 (#1 of 10) · recall 53.5% online: dev acts on comment = TP	~300k PRs · Martian Bench	Martian (promoted by CodeRabbit) ^[12]	Dataset & harness vendor-controlled; online signal gameable^[13]
CodeAnt vendor-promoted	51.7%	F1 (#3 of 17) online + 50-PR offline gold set	200k+ PRs · Martian Bench	Martian (promoted by CodeAnt) ^[13]	Same harness issues as CodeRabbit above
Independent Peer-Reviewed Studies — external dataset, not built by the evaluated vendors
SWRBench independent	19.4%	F1 (best: PR-Review + Gemini-2.5-Pro) range across all tested: 4.87–19.38%	1,000 PRs · 12 OSS Python repos	Academic researchers ^[2]	~90% LLM-human judge agreement · multi-review +43.67% F1^[1]
AIDev In-the-Wild independent	<60%	Signal ratio (12 of 13 agents) Copilot worst: 19.79% signal	3,109 real PRs	Academic researchers ^[9]	CRA-only merge rate: 45% vs 68% human-only review
CRScore independent	0.54	Spearman correlation with humans conciseness · comprehensiveness · relevance	2,900 human-scored reviews	NAACL 2025 ^[4]	BLEU fails: valid model review scores as low as 0.046^[4]

CodeRabbit blog post header for Martian benchmark results

CodeRabbit promoting the Martian benchmark where it ranked #1 — Martian's dataset and harness remain vendor-controlled ^[12]

Cursor BugBot — reports 70%+ resolution rate on its own BugBench; scores 49% on Propel's independent board ^[14]

The contamination problem in one number: Models score 80%+ on SWE-bench Verified, then drop to 46–57% on the contamination-resistant SWE-bench Pro — because public PRs leak into training data.^[38] Models identify the buggy file from issue text alone up to 76% of the time without repo access.^[42] 32.67% of o3 "successes" involved solution text that leaked from the issue description.^[43] Every code-review benchmark built on public PRs inherits this risk.

Why the Numbers Don't Reconcile — 4 Methodology Fouls

🚩

Tiny synthetic gold sets

Qodo injects bugs with an LLM ("corrupting the diff while preserving functionality"), then grades hits with another LLM.^[15] Greptile's set: 50 PRs, one planted bug each.^[11] Construct validity — does catching a seeded bug predict real-world usefulness? — is never argued.

🚩

Four incompatible metric definitions

"Catch rate" (≈ recall, Greptile), "F1" (Martian vendors), "resolution rate" (Cursor), "coverage" (Bito) — four different quantities stacked in one ranking.^[20] Greptile self-reports 82%; on Propel's board the same tool scores 45%.^[21] Same tool, ~half the number, different referee.

🚩

Benchmark contamination

Public PR histories leak into training. On SWE-bench, models hit 76% accuracy identifying the buggy file from issue text alone — without repo access.^[42] Verbatim solution memorization: 11.7–31.6% across models.^[42] Any benchmark built on public PRs is measuring recall, not reasoning.

🚩

Near-zero reproducibility

85 LLM-centric ICSE/ASE 2024 papers: 18 shared artifacts, 5 executable, 0 fully reproduced.^[44] Vendor benchmarks are self-run, never re-runnable by outsiders.^[20] Ground truth is ~36% noise in CodeReviewer data.^[8] A 2026 survey of 99 CR-benchmark papers: benchmarks and metrics remain immature.^[7]

Metric Hierarchy — What to Measure on Your Team

Acceptance / Resolved Rate

Did the comment cause a code change? Microsoft's 2015 operational definition of a useful review comment^[26] — the "online" signal Martian later industrialized.^[13] ~⅓ of 1.5M Microsoft comments were non-useful. Measure on your last 50 merged PRs.

Trusted

Precision / Signal-to-Noise Ratio

What fraction of flags are worth reading? At 15% FPR, ~13 of 15 "critical" weekly flags are wrong — engineers learn to dismiss the channel.^[25] 46% of developers already distrust AI accuracy; 66% cite "almost right but not quite" as their top frustration.^[23] Typical FPR 5–15%; well-tuned tools 5–8%.^[22]

Contextual

F1 / Recall (headline numbers)

What most vendor benchmarks report. Recall failures are invisible — a missed bug looks like a clean PR. High recall with excess output causes developer fatigue and tool abandonment.^[45] Recall-led vendors (Qodo, CodeRabbit) advertise finding more; precision-led vendors (Korbit^[19], the deprecated Graphite Diamond^[18]) advertise fewer false flags. Treat any external F1 as a prior, not a result.

Marketing

What a Credible Evaluation Actually Looks Like

✓
Fresh, time-split data. Use PRs post-dating the model's training cutoff (the SWE-rebench approach^[43]) — you're measuring reasoning, not memorization. The Verified→Pro drop (93%→50%) shows how much this matters.^[38]
✓
A shared, versioned, re-runnable harness. The opposite of the self-run vendor norm. Of 85 ICSE/ASE 2024 LLM papers: 18 shared artifacts, 5 executable, 0 fully reproduced.^[44]
✓
Blind human adjudication with reported inter-rater agreement. CRScore-style dimensions (conciseness · comprehensiveness · relevance)^[3], Krippendorff's α (handles missing data, any number of raters, ordinal/nominal/interval^[50]). Best-Worst Scaling reaches equivalent reliability with ~30% of labeling effort vs Likert (ρ=0.98 vs 0.95, p<.001).^[48] High IAA confirms consistency, not construct validity.^[50]
✓
Signal-to-noise / acceptance rate as the headline metric. Trust, not coverage, is what determines whether the tool survives contact with a team.^[45] A single F1 number hides the only thing that matters operationally: what fraction of what the bot says is worth reading.^[26]
✓
Evaluated on your own recent repos. The only benchmark that fully transfers is your last 50 merged PRs, with you adjudicating whether each AI comment would have helped. Every external number is a prior, not a result. Nearly every SE benchmark has at least one construct-validity weakness — the score doesn't support the claim made about real capability.^[47]

Established Rubrics — Use These Today

CRScore ^[4]

conciseness · comprehensiveness · relevance

Reference-free. Grounds evaluation in detected code claims and smells. Krippendorff α 0.85–0.89. Spearman 0.54 with humans. NAACL 2025. Exposes BLEU failure: valid review can score 0.046.

Microsoft Usefulness ^[26]

useful ↔ triggers nearby code change

Behavioral, binary. 1.5M comments at Microsoft; ~⅓ non-useful. Gold standard for acceptance-rate tracking. Directly usable as your in-house evaluation signal.

Hold On! (2025) ^[27]

Code · Text · Voice · Jargon features

Predicts usefulness via classifier (F1/AUC/MCC). >1 in 3 comments lack utility. Extends the Microsoft definition with feature decomposition for model training.

CodeFuse-CR-Bench ^[10]

correctness · completeness · relevance

Comprehensiveness-aware. Uses SNR as proxy for developer trust — high recall with excess output causes fatigue.^[45] LLM-as-judge with cross-provider judge recommended.^[35]

Series: AI-Assisted Code Review in 2026

survey

Market map: commercial vendors and platforms

expedition

Open-source tools, agents, and research ecosystem

expedition

Agentic code review workflows

expedition

Risks, limitations, and adoption barriers

recon

Security-specific review tooling