Which Model Reviews Code Best?

By Factory Research, Nizar Alrifai - April 29, 2026 - 4 minute read -

Research

Engineering

New

We benchmarked 13 models to find the best price-performance tradeoff for AI code review.

Every code review Droid produces is backed by a model. But which model gives the best results for the cost? The answer matters: it's the difference between catching a null pointer dereference in production and missing it entirely, and between spending $0.15 per PR and $5.63.

We built a benchmark to find out. We tested 13 models across 50 real pull requests from five major open-source projects: Sentry, Grafana, Keycloak, Discourse, and Cal.com. Each model reviewed every PR at least three times, using the same prompts, the same methodology, and the same reasoning effort level ("high" across the board). A human-curated golden set of known bugs served as ground truth, and an LLM judge scored each review against it. One question: which model finds the most real bugs per dollar?

Try it yourself: Run /install-code-review in Droid to configure PR reviews in GitHub or GitLab. /review is also available as a standalone skill.

The Money Chart

The best model isn't the most expensive one. Not even close.

F1 Score vs. Cost per PR

Higher is better on Y-axis, lower is better on X-axis. The best models live in the top-left.

OpenAIAnthropicGoogleOSSPareto frontier

The scatter plot tells the whole story. The top-left quadrant is where you want to be: high F1, low cost. The Pareto frontier (the dashed orange line connecting the best price-performance tradeoffs) runs from MiniMax M2.7 at $0.15/PR up through Kimi K2.5 and GPT-5.4 Mini, and finally to GPT-5.2 at $1.25/PR.

But the real story is the cluster in the bottom-left. Models like MiniMax M2.7, Kimi K2.5, and Gemini 3 Flash sit at $0.15-$0.41/PR and score 46-52% F1. That's 10-30x cheaper than frontier models for 75-86% of their quality. At those prices, you can afford to run multiple review passes with broader coverage and different angles, and still pay less than a single run of the expensive models.

Key Findings

GPT-5.2 and Claude Opus 4.6 lead the pack at ~60% F1, but GPT-5.2 does it at $1.25/PR vs $3.11 for Opus.
Newer doesn't always mean better. GPT-5.4 (47.5% F1) is too conservative -- high precision (59.6%) but low recall (41.8%), missing bugs it should catch. GPT-5.5 (47.9% F1) has the opposite problem: it comments at the right rate but half are false positives (47.5% precision). Both trail GPT-5.2 significantly.
Open-source models punch above their weight. Kimi K2.5 (51.9% F1 at $0.41/PR) and GLM-5.1 (55.8% at $1.06/PR) compete with frontier models at a fraction of the price. For teams that want broader review coverage with multiple passes, these models make intensive workflows economically viable.
Cost explains only ~21% of quality variance. Model architecture and training matter far more than token budget.

Our Picks

Best Overall

GPT-5.2

$1.25/PR60.5% F1

Top-tier quality at half the cost of Opus 4.6.

Best Value

Kimi K2.5

$0.41/PR51.9% F1

85%+ of top-tier quality for a fraction of the price.

Budget Pick

MiniMax M2.7

$0.15/PR45.6% F1

Run eight review passes for less than one GPT-5.2 run.

Full Rankings

Model Rankings

#	Model	Mean F1▼	Stdev▼	Precision▼	Recall▼
1	GPT-5.2	60.5%	±3	65%	57.6%
2	Opus 4.6	59.8%	±2.1	58.1%	61.8%
3	Sonnet 4.6	57.4%	±4.9	62.6%	47.3%
4	Opus 4.7	55.9%	±3.2	62.1%	54.2%
5	GLM-5.1	55.8%	±2.8	63.5%	50.7%
6	GPT-5.3 Codex	55.7%	±3.1	62.7%	50.8%
7	Gemini 3.1 Pro	52.1%	±2.4	55.4%	49.4%
8	Kimi K2.5	51.9%	±1.6	71.5%	40.7%
9	GPT-5.4 Mini	51.5%	±1.7	56.6%	48.1%
10	Gemini 3 Flash	49.5%	±2.2	60.1%	42.8%
11	GPT-5.5	47.9%	±1.9	47.5%	48.4%
12	GPT-5.4	47.5%	±1	59.6%	41.8%
13	MiniMax M2.7	45.6%	±4.3	59.1%	43.7%

OpenAIAnthropicGoogleOSS

Most models are remarkably consistent across runs, with standard deviations under 5 points. GPT-5.4 Mini and Kimi K2.5 stand out with stdev of just 1.7 and 1.6 respectively. If you need predictable quality, consistency matters as much as peak performance.

A note on GPT-5.4 and GPT-5.5: both underperform GPT-5.2, but for opposite reasons. GPT-5.4 is too conservative -- it comments sparingly (2.5/PR vs the 3.2 golden average) with decent precision (56.8%) but misses too many real bugs (42.4% recall). GPT-5.5 swings the other way: it comments at the right rate (3.5/PR) but nearly half are false positives (47.5% precision). In our experiments, GPT-5.4 needs explicit severity filters and strict constraints to stay focused, while GPT-5.5 needs tighter validation to filter out noise. Models like GPT-5.2 are more naturally calibrated with general-purpose prompts, which is why they outperform both in a standardized benchmark.

The Cost Story

Spending more doesn't reliably get you better reviews. Let's look at the numbers.

Cost Efficiency

Model	Mean F1▼	Cost/PR▲	$/F1 Point▲	Tokens/PR
MiniMax M2.7	45.6%	$0.15	$0.003	56K
Gemini 3 Flash	49.5%	$0.34	$0.007	124K
Kimi K2.5	51.9%	$0.41	$0.008	152K
GPT-5.4 Mini	51.5%	$0.68	$0.013	252K
GLM-5.1	55.8%	$1.06	$0.019	2.6M
Sonnet 4.6	57.4%	$1.15	$0.020	427K
GPT-5.2	60.5%	$1.25	$0.021	462K
GPT-5.3 Codex	55.7%	$1.69	$0.030	626K
GPT-5.4	47.5%	$2.01	$0.042	744K
Gemini 3.1 Pro	52.1%	$2.04	$0.039	755K
Opus 4.6	59.8%	$3.11	$0.052	1.2M
Opus 4.7	55.9%	$4.18	$0.075	3.1M
GPT-5.5	47.9%	$5.63	$0.118	4.2M

OpenAIAnthropicGoogleOSS

The cost-per-F1-point column is revealing. GPT-5.2 delivers top-tier quality at $0.021 per F1 point, while Opus 4.6 costs $0.052 for comparable performance. The most efficient models (GPT-5.4 Mini at $0.013/F1 point, Kimi K2.5 at $0.008/F1 point) deliver serious value at a fraction of the price.

We use multiplier-based token pricing at Factory, so cost comparisons across models are apples-to-apples, with no hidden subsidies or batch discounts distorting the picture.

Token consumption varies dramatically between models. GPT-5.5 uses 4.2 million tokens per PR, while MiniMax M2.7 uses just 56K. Most of those extra tokens aren't making the review better.

Methodology

Every benchmark run follows the same protocol: identical prompts, identical methodology, identical reasoning effort ("high") across all 13 models:

Test set: 50 real PRs from five open-source repositories (Sentry, Grafana, Keycloak, Discourse, Cal.com), selected for having non-trivial code changes.
Golden set: Human-curated set of known bugs and issues in each PR, reviewed and validated by engineers.
Model evaluation: Each model reviews every PR independently using the same prompt and configuration. We extract structured findings from each review.
LLM judge: An LLM compares each model's findings against the golden set, scoring matches on a semantic basis (not string matching).
Cross-judge validation: We swapped the judge model to check for self-favoring bias. Impact was ≤2 percentage points. No model got a meaningful home-court advantage.
F1 calculation: For each run, we compute precision (what fraction of the model's findings are real bugs) and recall (what fraction of real bugs the model found). F1 is the harmonic mean.
Multiple runs: Every model is evaluated 3 times. We report the mean and standard deviation across runs.
Outlier exclusion: Runs where a model clearly malfunctioned (e.g., refused to review, returned empty results) are excluded. We report only legitimate runs.

The entire benchmark is open source. The golden set, 50 PRs with human-curated bugs, lives across five repos in the droid-code-review-evals GitHub org: droid-sentry, droid-grafana, droid-keycloak, droid-discourse, and droid-cal_dot_com. The evaluation scripts, raw results, and scoring logic are in review-droid-benchmark.

What's Next

This benchmark powers how we select models for Droid's code review. We re-run it as new models launch and as our evaluation methodology improves. The golden set will grow, partly informed by the very gaps our models helped us discover.

We're also exploring more expensive review strategies with cheaper models. At $0.15/PR, you can run MiniMax M2.7 eight times for less than a single GPT-5.2 run. That opens up approaches that would be prohibitively expensive with frontier models: multi-pass reviews with different prompting strategies, ensemble voting across several runs, or targeted deep-dive passes on high-risk files. Early results suggest these compound strategies can close the gap with top-tier models while keeping costs low.

We also open-sourced Droid Action, a GitHub Action that runs AI code review on every PR. It supports all 13 models benchmarked here. Drop it into any repo and start getting reviews immediately.

Automated code review is a first line of defense, but building software autonomously requires the same rigor at every step: planning, implementation, testing, and deployment. That's what Factory is building: a complete software development system where every stage is automated, verified, and continuously improving. To try this on your own repo, run /install-code-review in Droid to configure PR reviews in GitHub or GitLab.