Act 1: The embarrassment
Our landing page said "multi-model AI review." Today I actually tested it for the first time -- tried to run GPT-4o through our own system. It didn't work.
Not "it gave bad results." It literally could not run. We found 4 bugs in Ensemble itself:
gpt-4o to gpt-4o-mini. Users thought they were running flagship. They weren't.openai package in the production deploy. Import failed silently.We fixed all four. Then we ran the real experiment.
Act 2: The experiment
We ran 3 models against Bun PR #26717 -- a real use-after-free that was merged, caused 500+ crashes, and got reverted the same day.
Same PR. Same prompt. 35 total runs.
The 3-model comparison
| Model | Tier | Runs | Caught the bug | Success rate | Cost |
|---|---|---|---|---|---|
| Claude Sonnet | Flagship | 15 | 6 (40%) | 100% | ~$0.60 |
| GPT-4o | Flagship | 10 | 4 (57% of successful) | 70% | $0.15 |
| GPT-4o-mini | Cheap | 10 | 0 (0%) | 18% | $0.03 |
Both flagship models found the bug. The cheap model found nothing -- and mostly couldn't even complete the task.
Act 3: Three surprising findings
GPT-4o-mini wasn't "a little worse." It was catastrophically broken:
to complete
any issue
(vs 100% Sonnet)
GPT-4o-mini, actual output "BLOCKED: No files found in the repository to analyze"
The cheap model couldn't even use tools properly. It wasn't finding different bugs or fewer bugs. It was failing to read the code at all. For security-sensitive review, "cheap" doesn't mean "less accurate." It means "broken."
If Claude and GPT-4o made independent errors, combining them would be powerful. The math says: a 40% Sonnet + 57% GPT-4o ensemble would catch ~75% of bugs with just 3 runs.
But they miss similar patterns. Both models tend to overlook the same subtle lifetime issues. This suggests training data overlap creates shared blind spots -- the models aren't as independent as you'd hope.
Bad news: "use multiple models" is not the silver bullet for coverage.
Good news: "run the same model multiple times" works just as well, and it's simpler.
This is the key result. Here's the ensemble catch rate from our actual runs:
The honest conclusion
Methodology
Full setup for reproducibility.
/api/experiment/drift (we dogfood)Target PR:
- oven-sh/bun #26717 -- introduced a new use-after-free in dictionary fields
- oven-sh/bun #26742 -- reverted same day (500+ crashes)
- oven-sh/bun #27324 -- proper fix (2 weeks later)
Want this on your PRs?
Ensemble runs reviews 3+ times and only surfaces what holds up. Fewer blind spots, more real bugs caught.