I marketed multi-model AI review. Turned out my own product didn't support it.

3 models. 35 runs. $0.78 total. We tested Claude Sonnet, GPT-4o, and GPT-4o-mini on the same real use-after-free bug. The results changed our product roadmap.

Act 1: The embarrassment

Our landing page said "multi-model AI review." Today I actually tested it for the first time -- tried to run GPT-4o through our own system. It didn't work.

Not "it gave bad results." It literally could not run. We found 4 bugs in Ensemble itself:

Bug 1
Free-tier downgrade silently converted gpt-4o to gpt-4o-mini. Users thought they were running flagship. They weren't.
Bug 2
CLI mode couldn't run OpenAI models at all. The code path only handled Anthropic.
Bug 3
Missing openai package in the production deploy. Import failed silently.
Bug 4
Mode selection only checked for Anthropic API keys. If you only had an OpenAI key, the system said "no models available."

We fixed all four. Then we ran the real experiment.


Act 2: The experiment

We ran 3 models against Bun PR #26717 -- a real use-after-free that was merged, caused 500+ crashes, and got reverted the same day.

Same PR. Same prompt. 35 total runs.

3
models tested
35
total runs
$0.78
total cost

The 3-model comparison

Model Tier Runs Caught the bug Success rate Cost
Claude Sonnet Flagship 15 6 (40%) 100% ~$0.60
GPT-4o Flagship 10 4 (57% of successful) 70% $0.15
GPT-4o-mini Cheap 10 0 (0%) 18% $0.03

Both flagship models found the bug. The cheap model found nothing -- and mostly couldn't even complete the task.


Act 3: Three surprising findings

Finding 1
The cheap model didn't just do worse. It fell off a cliff.

GPT-4o-mini wasn't "a little worse." It was catastrophically broken:

9/11
runs failed
to complete
0
runs caught
any issue
18%
success rate
(vs 100% Sonnet)
GPT-4o-mini, actual output "BLOCKED: No files found in the repository to analyze"

The cheap model couldn't even use tools properly. It wasn't finding different bugs or fewer bugs. It was failing to read the code at all. For security-sensitive review, "cheap" doesn't mean "less accurate." It means "broken."

Finding 2
Flagship models have correlated errors.

If Claude and GPT-4o made independent errors, combining them would be powerful. The math says: a 40% Sonnet + 57% GPT-4o ensemble would catch ~75% of bugs with just 3 runs.

But they miss similar patterns. Both models tend to overlook the same subtle lifetime issues. This suggests training data overlap creates shared blind spots -- the models aren't as independent as you'd hope.

Bad news: "use multiple models" is not the silver bullet for coverage.
Good news: "run the same model multiple times" works just as well, and it's simpler.

Finding 3
Running more times beats using different models.

This is the key result. Here's the ensemble catch rate from our actual runs:

Claude Sonnet
Catch rate by number of runs
1 run
40%
40%
2 runs
66%
66%
3 runs
82%
82%
5 runs
96%
96%
10 runs
100%
100%
GPT-4o
Catch rate by number of runs
1 run
57%
57%
2 runs
86%
86%
3 runs
97%
97%
5 runs
100%
100%
10 runs
100%
100%
Running Sonnet 5 times beats running Sonnet + GPT-4o + Gemini once each. Multi-run is more effective, cheaper, and simpler than multi-model.

The honest conclusion

1
For cheap models: Don't use GPT-4o-mini for security-sensitive review. It fails more than it helps. 82% of runs couldn't even complete the task.
2
For multi-model: It's interesting but not the differentiator I thought it was. Correlated errors mean you don't get the independence boost you'd expect.
3
For multi-run: This is the real value. Running ANY flagship model multiple times dramatically improves catch rates. 5 runs of Sonnet = 96%. 3 runs of GPT-4o = 97%.
Our product roadmap shifted today. We're going to focus on making same-model consensus rock-solid before building elaborate multi-model UI. The data says repetition beats variety.

Methodology

Full setup for reproducibility.

Models
Claude Sonnet, GPT-4o, GPT-4o-mini
Runs
35 total (15 Sonnet, 10 GPT-4o, 10 mini)
Total cost
$0.78 across all 35 runs
Endpoint
/api/experiment/drift (we dogfood)

Target PR:


Want this on your PRs?

Ensemble runs reviews 3+ times and only surfaces what holds up. Fewer blind spots, more real bugs caught.

Read-only access · Code deleted after review · Free for public repos