I ran the same AI code review 3 times. It found different bugs each time.

We tested Claude Sonnet on 5 open-source PRs with known bugs. 40% showed different findings across identical runs. Same model, same prompt, different results.

Every major AI code review tool works the same way: it reads a PR, runs one pass through a language model, and posts comments. If it misses something, you never know. If it hallucinates, you have no way to cross-check. The entire industry is built on the assumption that a single model run is reliable enough.

It isn't. LLMs are non-deterministic by design. Even at low temperatures, the same model given the same prompt will produce different outputs on different runs. Everyone who uses AI code review knows this in theory. Nobody measures what it actually means for the bugs that get caught -- or missed.

What we tested

We picked 5 real open-source pull requests, each with a known bug or vulnerability. We ran Claude Sonnet on each PR four times: once as a baseline (single pass), then three independent runs with no shared context. Same model, same system prompt, same code. The only difference: the inherent randomness of the model.

The results

Pull Request	Baseline	Run 2	Run 3	Drift?
langchain/langchain #36200 path traversal vulnerability	1 issue	0	1 issue	YES
oven-sh/bun #26717 use-after-free	1 issue	2 issues	1 issue	YES
facebook/react #14182 stream error handling	0	0	0	No
vercel/next.js #67211 TypeScript plugin	0	0	0	No
supabase/supabase #43370 race condition	0	0	0	No

2 out of 5 PRs produced different findings depending on which run you happened to look at. The model didn't get worse or better -- it just rolled differently each time.

The Bun result: 0 to 2 issues on the same code

oven-sh/bun #26717 -- use-after-free vulnerability

Same model. Same prompt. Same code. Three independent runs.

Run 1

issues found

→

Run 2

issues found

→

Run 3

issue found

If you ran a single review on this PR, you had a 1-in-3 chance of getting zero findings on a PR with a real use-after-free vulnerability. Run 2 found twice as many issues as Run 3. This isn't a model quality problem -- it's a sampling problem.

The LangChain result: the baseline caught what the ensemble mostly missed

langchain-ai/langchain #36200 -- path traversal vulnerability

The baseline caught the issue. Two of three ensemble runs missed it entirely.

Baseline

issue found

→

Run 1

missed

→

Run 2

missed

→

Run 3

issue found

This is the scariest result. The baseline review -- a single pass -- caught a real path traversal vulnerability. But if you ran the same review three more times, two of those runs would have told you the code was clean. The finding wasn't consistent enough to survive a single retry.

This is the core problem: a single AI code review is a coin flip on which subset of findings you get. You might catch a critical vulnerability. You might not. And you'll never know what you missed because the model won't tell you it rolled badly.

What this means

If you're relying on a single AI review pass, you're getting a random subset of what the model could find. It's not that AI code review doesn't work -- it's that running it once is rolling the dice on which findings you get. Some runs are thorough. Some runs miss obvious vulnerabilities. You can't tell the difference from the output alone.

Running 3 times and using consensus changes the math. Findings that appear across multiple independent runs are more likely to be real bugs and less likely to be hallucinated. Findings that only appear in one run are either false positives or things the model inconsistently notices -- either way, they need different handling than findings the model consistently flags.

Methodology

We're publishing the full setup so anyone can reproduce this.

Model

Claude Sonnet (Anthropic)

Runs per PR

4 (1 baseline + 3 ensemble)

Shared context

None -- fully independent

Prompt

Standard code review prompt

PRs tested:

langchain-ai/langchain #36200 -- path traversal in prompt.save/load_prompt
oven-sh/bun #26717 -- use-after-free in Bun runtime
facebook/react #14182 -- stream error handling
vercel/next.js #67211 -- TypeScript plugin issue
supabase/supabase #43370 -- race condition

We built Ensemble to solve this

It runs 3 independent reviews and only surfaces what holds up across runs. Fewer false positives, more real bugs caught.

Install on every PR

Read-only access · Code deleted after review · Free for public repos