Every major AI code review tool works the same way: it reads a PR, runs one pass through a language model, and posts comments. If it misses something, you never know. If it hallucinates, you have no way to cross-check. The entire industry is built on the assumption that a single model run is reliable enough.
It isn't. LLMs are non-deterministic by design. Even at low temperatures, the same model given the same prompt will produce different outputs on different runs. Everyone who uses AI code review knows this in theory. Nobody measures what it actually means for the bugs that get caught -- or missed.
What we tested
We picked 5 real open-source pull requests, each with a known bug or vulnerability. We ran Claude Sonnet on each PR four times: once as a baseline (single pass), then three independent runs with no shared context. Same model, same system prompt, same code. The only difference: the inherent randomness of the model.
The results
| Pull Request | Baseline | Run 1 | Run 2 | Run 3 | Drift? |
|---|---|---|---|---|---|
| langchain/langchain #36200 path traversal vulnerability | 1 issue | 0 | 0 | 1 issue | YES |
| oven-sh/bun #26717 use-after-free | 1 issue | 0 | 2 issues | 1 issue | YES |
| facebook/react #14182 stream error handling | 0 | 0 | 0 | 0 | No |
| vercel/next.js #67211 TypeScript plugin | 0 | 0 | 0 | 0 | No |
| supabase/supabase #43370 race condition | 0 | 0 | 0 | 0 | No |
2 out of 5 PRs produced different findings depending on which run you happened to look at. The model didn't get worse or better -- it just rolled differently each time.
The Bun result: 0 to 2 issues on the same code
If you ran a single review on this PR, you had a 1-in-3 chance of getting zero findings on a PR with a real use-after-free vulnerability. Run 2 found twice as many issues as Run 3. This isn't a model quality problem -- it's a sampling problem.
The LangChain result: the baseline caught what the ensemble mostly missed
This is the scariest result. The baseline review -- a single pass -- caught a real path traversal vulnerability. But if you ran the same review three more times, two of those runs would have told you the code was clean. The finding wasn't consistent enough to survive a single retry.
What this means
If you're relying on a single AI review pass, you're getting a random subset of what the model could find. It's not that AI code review doesn't work -- it's that running it once is rolling the dice on which findings you get. Some runs are thorough. Some runs miss obvious vulnerabilities. You can't tell the difference from the output alone.
Running 3 times and using consensus changes the math. Findings that appear across multiple independent runs are more likely to be real bugs and less likely to be hallucinated. Findings that only appear in one run are either false positives or things the model inconsistently notices -- either way, they need different handling than findings the model consistently flags.
Methodology
We're publishing the full setup so anyone can reproduce this.
PRs tested:
- langchain-ai/langchain #36200 -- path traversal in prompt.save/load_prompt
- oven-sh/bun #26717 -- use-after-free in Bun runtime
- facebook/react #14182 -- stream error handling
- vercel/next.js #67211 -- TypeScript plugin issue
- supabase/supabase #43370 -- race condition
We built Ensemble to solve this
It runs 3 independent reviews and only surfaces what holds up across runs. Fewer false positives, more real bugs caught.