Which Model Reviews Code Best?
By Factory Research, Nizar Alrifai - April 29, 2026 - 4 minute read -
Share
Research
Engineering
New
We benchmarked 13 models to find the best price-performance tradeoff for AI code review.
By Factory Research, Nizar Alrifai - April 29, 2026 - 4 minute read -
Share
Research
Engineering
New
We benchmarked 13 models to find the best price-performance tradeoff for AI code review.
Every code review Droid produces is backed by a model. But which model gives the best results for the cost? The answer matters: it's the difference between catching a null pointer dereference in production and missing it entirely, and between spending $0.15 per PR and $5.63.
We built a benchmark to find out. We tested 13 models across 50 real pull requests from five major open-source projects: Sentry, Grafana, Keycloak, Discourse, and Cal.com. Each model reviewed every PR at least three times, using the same prompts, the same methodology, and the same reasoning effort level ("high" across the board). A human-curated golden set of known bugs served as ground truth, and an LLM judge scored each review against it. One question: which model finds the most real bugs per dollar?
Try it yourself: Run
/install-code-reviewin Droid to configure PR reviews in GitHub or GitLab./reviewis also available as a standalone skill.
The best model isn't the most expensive one. Not even close.
Higher is better on Y-axis, lower is better on X-axis. The best models live in the top-left.
The scatter plot tells the whole story. The top-left quadrant is where you want to be: high F1, low cost. The Pareto frontier (the dashed orange line connecting the best price-performance tradeoffs) runs from MiniMax M2.7 at $0.15/PR up through Kimi K2.5 and GPT-5.4 Mini, and finally to GPT-5.2 at $1.25/PR.
But the real story is the cluster in the bottom-left. Models like MiniMax M2.7, Kimi K2.5, and Gemini 3 Flash sit at $0.15-$0.41/PR and score 46-52% F1. That's 10-30x cheaper than frontier models for 75-86% of their quality. At those prices, you can afford to run multiple review passes with broader coverage and different angles, and still pay less than a single run of the expensive models.
GPT-5.2
Top-tier quality at half the cost of Opus 4.6.
Kimi K2.5
85%+ of top-tier quality for a fraction of the price.
MiniMax M2.7
Run eight review passes for less than one GPT-5.2 run.
| # | Model | Mean F1▼ | Stdev▼ | Precision▼ | Recall▼ |
|---|---|---|---|---|---|
| 1 | GPT-5.2 | 60.5% | ±3 | 65% | 57.6% |
| 2 | Opus 4.6 | 59.8% | ±2.1 | 58.1% | 61.8% |
| 3 | Sonnet 4.6 | 57.4% | ±4.9 | 62.6% | 47.3% |
| 4 | Opus 4.7 | 55.9% | ±3.2 | 62.1% | 54.2% |
| 5 | GLM-5.1 | 55.8% | ±2.8 | 63.5% | 50.7% |
| 6 | GPT-5.3 Codex | 55.7% | ±3.1 | 62.7% | 50.8% |
| 7 | Gemini 3.1 Pro | 52.1% | ±2.4 | 55.4% | 49.4% |
| 8 | Kimi K2.5 | 51.9% | ±1.6 | 71.5% | 40.7% |
| 9 | GPT-5.4 Mini | 51.5% | ±1.7 | 56.6% | 48.1% |
| 10 | Gemini 3 Flash | 49.5% | ±2.2 | 60.1% | 42.8% |
| 11 | GPT-5.5 | 47.9% | ±1.9 | 47.5% | 48.4% |
| 12 | GPT-5.4 | 47.5% | ±1 | 59.6% | 41.8% |
| 13 | MiniMax M2.7 | 45.6% | ±4.3 | 59.1% | 43.7% |
Most models are remarkably consistent across runs, with standard deviations under 5 points. GPT-5.4 Mini and Kimi K2.5 stand out with stdev of just 1.7 and 1.6 respectively. If you need predictable quality, consistency matters as much as peak performance.
A note on GPT-5.4 and GPT-5.5: both underperform GPT-5.2, but for opposite reasons. GPT-5.4 is too conservative -- it comments sparingly (2.5/PR vs the 3.2 golden average) with decent precision (56.8%) but misses too many real bugs (42.4% recall). GPT-5.5 swings the other way: it comments at the right rate (3.5/PR) but nearly half are false positives (47.5% precision). In our experiments, GPT-5.4 needs explicit severity filters and strict constraints to stay focused, while GPT-5.5 needs tighter validation to filter out noise. Models like GPT-5.2 are more naturally calibrated with general-purpose prompts, which is why they outperform both in a standardized benchmark.
Spending more doesn't reliably get you better reviews. Let's look at the numbers.
| Model | Mean F1▼ | Cost/PR▲ | $/F1 Point▲ | Tokens/PR |
|---|---|---|---|---|
| MiniMax M2.7 | 45.6% | $0.15 | $0.003 | 56K |
| Gemini 3 Flash | 49.5% | $0.34 | $0.007 | 124K |
| Kimi K2.5 | 51.9% | $0.41 | $0.008 | 152K |
| GPT-5.4 Mini | 51.5% | $0.68 | $0.013 | 252K |
| GLM-5.1 | 55.8% | $1.06 | $0.019 | 2.6M |
| Sonnet 4.6 | 57.4% | $1.15 | $0.020 | 427K |
| GPT-5.2 | 60.5% | $1.25 | $0.021 | 462K |
| GPT-5.3 Codex | 55.7% | $1.69 | $0.030 | 626K |
| GPT-5.4 | 47.5% | $2.01 | $0.042 | 744K |
| Gemini 3.1 Pro | 52.1% | $2.04 | $0.039 | 755K |
| Opus 4.6 | 59.8% | $3.11 | $0.052 | 1.2M |
| Opus 4.7 | 55.9% | $4.18 | $0.075 | 3.1M |
| GPT-5.5 | 47.9% | $5.63 | $0.118 | 4.2M |
The cost-per-F1-point column is revealing. GPT-5.2 delivers top-tier quality at $0.021 per F1 point, while Opus 4.6 costs $0.052 for comparable performance. The most efficient models (GPT-5.4 Mini at $0.013/F1 point, Kimi K2.5 at $0.008/F1 point) deliver serious value at a fraction of the price.
We use multiplier-based token pricing at Factory, so cost comparisons across models are apples-to-apples, with no hidden subsidies or batch discounts distorting the picture.
Token consumption varies dramatically between models. GPT-5.5 uses 4.2 million tokens per PR, while MiniMax M2.7 uses just 56K. Most of those extra tokens aren't making the review better.
Every benchmark run follows the same protocol: identical prompts, identical methodology, identical reasoning effort ("high") across all 13 models:
The entire benchmark is open source. The golden set, 50 PRs with human-curated bugs, lives across five repos in the droid-code-review-evals GitHub org: droid-sentry, droid-grafana, droid-keycloak, droid-discourse, and droid-cal_dot_com. The evaluation scripts, raw results, and scoring logic are in review-droid-benchmark.
This benchmark powers how we select models for Droid's code review. We re-run it as new models launch and as our evaluation methodology improves. The golden set will grow, partly informed by the very gaps our models helped us discover.
We're also exploring more expensive review strategies with cheaper models. At $0.15/PR, you can run MiniMax M2.7 eight times for less than a single GPT-5.2 run. That opens up approaches that would be prohibitively expensive with frontier models: multi-pass reviews with different prompting strategies, ensemble voting across several runs, or targeted deep-dive passes on high-risk files. Early results suggest these compound strategies can close the gap with top-tier models while keeping costs low.
We also open-sourced Droid Action, a GitHub Action that runs AI code review on every PR. It supports all 13 models benchmarked here. Drop it into any repo and start getting reviews immediately.
Automated code review is a first line of defense, but building software autonomously requires the same rigor at every step: planning, implementation, testing, and deployment. That's what Factory is building: a complete software development system where every stage is automated, verified, and continuously improving. To try this on your own repo, run /install-code-review in Droid to configure PR reviews in GitHub or GitLab.
start building
Start building