I marketed multi-model AI review. Turned out my own product didn't support it.

3 models. 35 runs. $0.78 total. We tested Claude Sonnet, GPT-4o, and GPT-4o-mini on the same real use-after-free bug. The results changed our product roadmap.

Act 1: The embarrassment

Our landing page said "multi-model AI review." Today I actually tested it for the first time -- tried to run GPT-4o through our own system. It didn't work.

Not "it gave bad results." It literally could not run. We found 4 bugs in Ensemble itself:

Bug 1

Free-tier downgrade silently converted gpt-4o to gpt-4o-mini. Users thought they were running flagship. They weren't.

Bug 2

CLI mode couldn't run OpenAI models at all. The code path only handled Anthropic.

Bug 3

Missing openai package in the production deploy. Import failed silently.

Bug 4

Mode selection only checked for Anthropic API keys. If you only had an OpenAI key, the system said "no models available."

We fixed all four. Then we ran the real experiment.

Act 2: The experiment

We ran 3 models against Bun PR #26717 -- a real use-after-free that was merged, caused 500+ crashes, and got reverted the same day.

Same PR. Same prompt. 35 total runs.

models tested

total runs

$0.78

total cost

The 3-model comparison

Model	Tier	Runs	Caught the bug	Success rate	Cost
Claude Sonnet	Flagship	15	6 (40%)	100%	~$0.60
GPT-4o	Flagship	10	4 (57% of successful)	70%	$0.15
GPT-4o-mini	Cheap	10	0 (0%)	18%	$0.03

Both flagship models found the bug. The cheap model found nothing -- and mostly couldn't even complete the task.

Act 3: Three surprising findings

Finding 1

The cheap model didn't just do worse. It fell off a cliff.

GPT-4o-mini wasn't "a little worse." It was catastrophically broken:

9/11

runs failed
to complete

runs caught
any issue

18%

success rate
(vs 100% Sonnet)

GPT-4o-mini, actual output "BLOCKED: No files found in the repository to analyze"

The cheap model couldn't even use tools properly. It wasn't finding different bugs or fewer bugs. It was failing to read the code at all. For security-sensitive review, "cheap" doesn't mean "less accurate." It means "broken."

Finding 2

Flagship models have correlated errors.

If Claude and GPT-4o made independent errors, combining them would be powerful. The math says: a 40% Sonnet + 57% GPT-4o ensemble would catch ~75% of bugs with just 3 runs.

But they miss similar patterns. Both models tend to overlook the same subtle lifetime issues. This suggests training data overlap creates shared blind spots -- the models aren't as independent as you'd hope.

Bad news: "use multiple models" is not the silver bullet for coverage.
Good news: "run the same model multiple times" works just as well, and it's simpler.

Finding 3

Running more times beats using different models.

This is the key result. Here's the ensemble catch rate from our actual runs:

Claude Sonnet

Catch rate by number of runs

1 run

40%

2 runs

66%

3 runs

82%

5 runs

96%

10 runs

100%

GPT-4o

Catch rate by number of runs

1 run

57%

2 runs

86%

3 runs

97%

5 runs

100%

10 runs

100%

Running Sonnet 5 times beats running Sonnet + GPT-4o + Gemini once each. Multi-run is more effective, cheaper, and simpler than multi-model.

The honest conclusion

For cheap models: Don't use GPT-4o-mini for security-sensitive review. It fails more than it helps. 82% of runs couldn't even complete the task.

For multi-model: It's interesting but not the differentiator I thought it was. Correlated errors mean you don't get the independence boost you'd expect.

For multi-run: This is the real value. Running ANY flagship model multiple times dramatically improves catch rates. 5 runs of Sonnet = 96%. 3 runs of GPT-4o = 97%.

Our product roadmap shifted today. We're going to focus on making same-model consensus rock-solid before building elaborate multi-model UI. The data says repetition beats variety.

Methodology

Full setup for reproducibility.

Models

Claude Sonnet, GPT-4o, GPT-4o-mini

Runs

35 total (15 Sonnet, 10 GPT-4o, 10 mini)

Total cost

$0.78 across all 35 runs

Endpoint

/api/experiment/drift (we dogfood)

Target PR:

oven-sh/bun #26717 -- introduced a new use-after-free in dictionary fields
oven-sh/bun #26742 -- reverted same day (500+ crashes)
oven-sh/bun #27324 -- proper fix (2 weeks later)

Want this on your PRs?

Ensemble runs reviews 3+ times and only surfaces what holds up. Fewer blind spots, more real bugs caught.

Install Ensemble Read the single-model experiment

Read-only access · Code deleted after review · Free for public repos