One AI caught a use-after-free in Bun. Another said the code was clean. Same model, same prompt.

We ran Claude Sonnet on Bun PR #26717 fifteen times. 6 caught the real bug. 3 said "no issues." The difference between "ship it" and "this will crash on Windows" was literally which random sample you got.

Bun PR #26717 was a Claude Code-generated fix for a use-after-free bug in Bun's Zig runtime. It was merged. It shipped. And it introduced a new use-after-free in dictionary string fields that caused 500+ crash reports on Windows.

The PR was reverted the same day. A proper fix took two more weeks.

We wanted to know: could AI code review have caught this before it shipped? Not whether AI is "good enough" in theory, but whether a single review pass would actually flag the specific bug that made it to production.

The timeline

PR #26717 merged

Claude Code-generated fix for a use-after-free. Addresses the original bug but misses an identical pattern in emitConvertDictionaryFunction. The new code leaves a dangling pointer in dictionary string fields.

PR #26742 -- reverted same day

500+ crash reports on Windows. The new use-after-free triggers when dictionary strings are freed while still referenced. Full revert.

PR #27324 -- proper fix, 2 weeks later

Addresses both the original use-after-free AND the dictionary string field pattern that #26717 missed. Merged successfully.

The experiment

We ran Claude Sonnet against PR #26717 fifteen times. Same model. Same prompt. Same diff. No shared context between runs. The only variable: the inherent non-determinism of the model.

For each run, we recorded two things: total issues flagged, and whether the run identified the specific dictionary string field gap -- the bug that actually caused the crashes.

Every result, in order

Run	Config	Issues	Dictionary bug?	Key finding
1	baseline	2	Caught	"Same bug in emitConvertDictionaryFunction"
2	1x	1	Missed	Found unrelated issue only
3	baseline	0	No issues	"No critical security vulnerabilities"
4	2x	3	Missed	3 issues, all tangential
5	2x	2	Partial	Mentioned dictionary area, no specifics
6	baseline	2	Caught	"Uninitialized variable in dictionary"
7	3x	1	Caught	"Could still trigger UAF for defaults"
8	3x	0	No issues	"Code looks correct"
9	3x	1	Partial	Vague concern about lifetime
10	baseline	3	Caught	"Identical pattern repeats, systematic oversight"
11	5x	3	Caught	"Introduces new undefined behavior"
12	5x	0	No issues	"No issues found"
13	5x	2	Caught	"Duplicate uninitialized variable issue"
14	5x	2	Missed	Found 2 issues, neither was the real bug
15	5x	2	Partial	Close but didn't name the root cause

The numbers

3/15

said "no issues"
completely blind

6/15

caught the real bug
dictionary gap

9/15

missed the real bug
including "no issues"

20% of runs saw nothing wrong with code that caused 500+ crash reports. These weren't hedged responses. Run 3 said: "No critical security vulnerabilities introduced. This is a memory safety improvement." It was confidently, completely wrong.

The convergence math

If a single run has a 40% chance of catching the bug, what happens when you run multiple times and take the union of findings?

Probability of at least one run catching the dictionary bug

Based on 6/15 observed detection rate across independent runs

1 run

40%

3 runs

67%

5 runs

80%

One run is a coin flip. Three runs get you to two-thirds. Five runs and you're at 80%. The math is simple: more samples, fewer blind spots.

The finding vs. the miss

Here's what Run 1 actually said -- the run that caught the bug:

Run 1 -- caught it "Same bug in emitConvertDictionaryFunction (line ~688). The fix addresses emitConvertComptime but the identical use-after-free pattern repeats in the dictionary emission path. After the string is freed, the pointer is still used for field comparison."

And here's Run 3 -- reviewing the exact same diff, with the exact same model and prompt:

Run 3 -- missed everything "No critical security vulnerabilities introduced. This is a memory safety improvement that correctly addresses the use-after-free by ensuring string data is properly retained before pointer access."

Run 1 identified the exact function name, the approximate line number, and the mechanism of the bug. Run 3 not only missed it -- it described the code as a "memory safety improvement" in the same area where a new memory safety bug was being introduced.

The distance between these two outputs is not a quality gap. Same model, same weights, same prompt, same code. The only difference is which path the sampling took through the probability distribution. Run 3 isn't a "bad model." It's a bad sample from a good model.

What this means

Running AI code review once is gambling. You might get Run 1, which identifies the exact function and line where a use-after-free will ship. Or you might get Run 3, which tells you the code is a memory safety improvement. You cannot tell from the output which one you got.

The fix is not a better model. Sonnet clearly can find this bug -- it did so 40% of the time. The fix is running the same review multiple times and surfacing findings that hold up across independent samples. One run found "systematic oversight." Another found "new undefined behavior." A third found "duplicate uninitialized variable." Each described the same underlying bug from a different angle.

A single review told you nothing was wrong. Multiple reviews, taken together, converge on the truth.

Methodology

Full setup for reproducibility.

Model

Claude Sonnet (Anthropic)

Runs

15 total (4 baseline, 2x1, 2x2, 3x3, 5x5)

Shared context

None -- fully independent

Prompt

Standard code review prompt

PRs referenced:

oven-sh/bun #26717 -- original Claude Code-generated fix (introduced new UAF)
oven-sh/bun #26742 -- revert (same day, 500+ crash reports)
oven-sh/bun #27324 -- proper fix (2 weeks later)

Ensemble runs every review 3+ times

One run is a coin flip. Ensemble runs multiple independent reviews and only surfaces what holds up. Fewer blind spots, more real bugs caught.

Install on every PR

Read-only access · Code deleted after review · Free for public repos