Bun PR #26717 was a Claude Code-generated fix for a use-after-free bug in Bun's Zig runtime. It was merged. It shipped. And it introduced a new use-after-free in dictionary string fields that caused 500+ crash reports on Windows.
The PR was reverted the same day. A proper fix took two more weeks.
We wanted to know: could AI code review have caught this before it shipped? Not whether AI is "good enough" in theory, but whether a single review pass would actually flag the specific bug that made it to production.
The timeline
emitConvertDictionaryFunction. The new code leaves a dangling pointer in dictionary string fields.The experiment
We ran Claude Sonnet against PR #26717 fifteen times. Same model. Same prompt. Same diff. No shared context between runs. The only variable: the inherent non-determinism of the model.
For each run, we recorded two things: total issues flagged, and whether the run identified the specific dictionary string field gap -- the bug that actually caused the crashes.
Every result, in order
| Run | Config | Issues | Dictionary bug? | Key finding |
|---|---|---|---|---|
| 1 | baseline | 2 | Caught | "Same bug in emitConvertDictionaryFunction" |
| 2 | 1x | 1 | Missed | Found unrelated issue only |
| 3 | baseline | 0 | No issues | "No critical security vulnerabilities" |
| 4 | 2x | 3 | Missed | 3 issues, all tangential |
| 5 | 2x | 2 | Partial | Mentioned dictionary area, no specifics |
| 6 | baseline | 2 | Caught | "Uninitialized variable in dictionary" |
| 7 | 3x | 1 | Caught | "Could still trigger UAF for defaults" |
| 8 | 3x | 0 | No issues | "Code looks correct" |
| 9 | 3x | 1 | Partial | Vague concern about lifetime |
| 10 | baseline | 3 | Caught | "Identical pattern repeats, systematic oversight" |
| 11 | 5x | 3 | Caught | "Introduces new undefined behavior" |
| 12 | 5x | 0 | No issues | "No issues found" |
| 13 | 5x | 2 | Caught | "Duplicate uninitialized variable issue" |
| 14 | 5x | 2 | Missed | Found 2 issues, neither was the real bug |
| 15 | 5x | 2 | Partial | Close but didn't name the root cause |
The numbers
completely blind
dictionary gap
including "no issues"
20% of runs saw nothing wrong with code that caused 500+ crash reports. These weren't hedged responses. Run 3 said: "No critical security vulnerabilities introduced. This is a memory safety improvement." It was confidently, completely wrong.
The convergence math
If a single run has a 40% chance of catching the bug, what happens when you run multiple times and take the union of findings?
One run is a coin flip. Three runs get you to two-thirds. Five runs and you're at 80%. The math is simple: more samples, fewer blind spots.
The finding vs. the miss
Here's what Run 1 actually said -- the run that caught the bug:
Run 1 -- caught it "Same bug in emitConvertDictionaryFunction (line ~688). The fix addresses emitConvertComptime but the identical use-after-free pattern repeats in the dictionary emission path. After the string is freed, the pointer is still used for field comparison."
And here's Run 3 -- reviewing the exact same diff, with the exact same model and prompt:
Run 3 -- missed everything "No critical security vulnerabilities introduced. This is a memory safety improvement that correctly addresses the use-after-free by ensuring string data is properly retained before pointer access."
Run 1 identified the exact function name, the approximate line number, and the mechanism of the bug. Run 3 not only missed it -- it described the code as a "memory safety improvement" in the same area where a new memory safety bug was being introduced.
What this means
Running AI code review once is gambling. You might get Run 1, which identifies the exact function and line where a use-after-free will ship. Or you might get Run 3, which tells you the code is a memory safety improvement. You cannot tell from the output which one you got.
The fix is not a better model. Sonnet clearly can find this bug -- it did so 40% of the time. The fix is running the same review multiple times and surfacing findings that hold up across independent samples. One run found "systematic oversight." Another found "new undefined behavior." A third found "duplicate uninitialized variable." Each described the same underlying bug from a different angle.
A single review told you nothing was wrong. Multiple reviews, taken together, converge on the truth.
Methodology
Full setup for reproducibility.
PRs referenced:
- oven-sh/bun #26717 -- original Claude Code-generated fix (introduced new UAF)
- oven-sh/bun #26742 -- revert (same day, 500+ crash reports)
- oven-sh/bun #27324 -- proper fix (2 weeks later)
Ensemble runs every review 3+ times
One run is a coin flip. Ensemble runs multiple independent reviews and only surfaces what holds up. Fewer blind spots, more real bugs caught.