We flagged the Next.js OOM bug before it shipped. It shipped anyway.

Ensemble's 3 independent reviewers all flagged unbounded allocation risks in Next.js PR #91729. Without a merge gate, it merged, hit production, and got reverted 3 days later.

The bug

Next.js PR #91729 -- "simplify session dependent tasks and add TTL support" -- landed in one of the most-used open source projects on GitHub (139k stars). 25 files changed. +703/-288 lines. Routine refactor with a new TTL mechanism for session-dependent cache invalidation.

Three days later, applications with persistent cache started crashing. Out-of-memory. The root cause: unbounded allocation from invalidation cascading during session restore. When a session was restored, the invalidation logic spawned timers that held references to invalidators. Multiple fetch re-executions before the TTL deadline meant multiple timers stacking up, each holding memory that could never be reclaimed. Under persistent cache workloads, this grew without bound until the process ran out of memory.

The fix was a revert: PR #92320, merged 3 days after the original.

days in production
before revert

files changed
+703/-288 lines

3/3

Ensemble reviewers
flagged the risk

This is a normal story. Big project, reasonable PR, subtle bug that only manifests under specific conditions. What's unusual is that we have a record of what Ensemble's AI reviewers said about it before it merged.

What Ensemble found

Ensemble ran 3 independent reviewers against this PR: a Senior Code Reviewer, a Security Engineer, and an Architecture Reviewer. Each reviews the diff independently, with different system prompts and different concerns. Here's what they said.

Senior Code Reviewer

Flagged resource exhaustion

"Spawned timer task holds a reference to the invalidator. If fetch is re-executed multiple times before the deadline, multiple timers will be spawned... Could cause resource exhaustion under certain conditions (frequent fetch re-executions before TTL expiry)"

Architecture Reviewer

Flagged race condition + unbounded timers

"This is racy: No guarantee timer fires before strongly-consistent read"

"Fetch timer race condition: Timer spawning has acknowledged race conditions on session restore and potential unbounded timer accumulation"

Security Engineer

Flagged overflow risk

"Integer overflow risk in TTL deadline math"

Three independent reviewers. All three flagged something related to the same area of the code -- the timer/invalidation/TTL logic. The Senior Reviewer and Architecture Reviewer both specifically called out unbounded timer accumulation. The Security Engineer flagged a related numeric issue in the same TTL code path.

The actual production bug? Unbounded allocation on session restore. That's almost word-for-word what the reviewers flagged.

To be clear: None of the reviewers said "this will cause an OOM crash." They identified the conditions -- unbounded timer spawning, references held during session restore, race conditions in the invalidation path -- that, combined, produced the crash. They described the mechanism. They didn't predict the exact failure mode.

The gap

So the system flagged it. All three reviewers raised concerns about the exact code path that broke. Why did it ship?

Because Ensemble marked it PASS.

Right now, Ensemble has no Trust Score threshold. No merge gate. Reviews run, findings get generated, and the system outputs a pass/fail verdict based on... nothing, really. There's no numeric threshold that says "if 3 out of 3 reviewers flag resource exhaustion, block the merge." The concerns were surfaced as comments. Comments that live alongside dozens of other AI-generated observations. Comments that look like every other code review suggestion.

This is the core problem: findings without a gate are just more noise.

Developers already ignore most automated comments. If you've used any AI code review tool, you know the pattern -- the bot posts 15 comments, 13 are style nits, 1 is mildly useful, and maybe 1 matters. After a few PRs, you train yourself to scroll past all of them. The signal-to-noise ratio makes it rational to ignore the output.

Reviews are opinions. Gates are infrastructure. An opinion that says "this might leak memory" competes with 50 other opinions. A gate that says "this PR cannot merge until the Trust Score exceeds 72" is a different thing entirely. One requires a human to read, evaluate, and act. The other requires a human to explicitly override.

We had the right findings and the wrong delivery mechanism. The reviewers did their job. The system didn't do its job -- because the system's job is to turn findings into decisions, and we hadn't built that part yet.

What we're building

The lesson from this PR is straightforward: detection without enforcement is a suggestion. We need to close the loop.

Trust Score (0-100)

Every PR gets a numeric score derived from the ensemble output. Not a vibes-based "looks good" -- a computed score that accounts for:

How many independent reviewers flagged the same concern (consensus weight)
Severity classification of each finding (resource exhaustion > style nit)
Whether findings cluster around the same code path (correlated risk)
Whether the PR touches high-risk patterns (concurrency, memory management, auth)

In this case, 3/3 reviewers flagging resource-related issues in the same timer/invalidation code path would have produced a low Trust Score. Not because any single finding was a showstopper, but because the convergence across independent reviewers indicates real risk.

GitHub Check Run

The Trust Score gets reported as a GitHub Check Run -- the same mechanism used by CI, linters, and test suites. If the score falls below the repo's configured threshold, the check fails and the PR cannot merge (assuming branch protection requires passing checks).

This is the difference between "we posted a comment" and "we blocked the merge." The developer can still override it -- dismiss the check, lower the threshold, merge with admin privileges. But the default path changes from "ignore and merge" to "acknowledge the risk and proceed."

What this would have looked like

PR #91729 with a Trust Score gate

The 3 reviewers flag resource exhaustion, race conditions, and overflow risk in the same code path. The Trust Score computes to, say, 38/100. The repo threshold is set to 65. The GitHub Check fails. The PR shows a red X.

The author sees: "Ensemble: Trust Score 38 -- 3/3 reviewers flagged unbounded resource allocation in session restore path." They can still merge. But they have to make a conscious decision to override the gate, which is a fundamentally different interaction than scrolling past a comment.

The honest assessment

The reviewers worked. Three independent AI reviewers flagged the mechanism that caused the production crash. The multi-reviewer approach produced signal. That part of the system did what it was supposed to do.

The system failed. Good findings with no enforcement mechanism is indistinguishable from no findings at all. We generated the right warnings and delivered them in a way that made them easy to ignore.

The fix is not "better AI." The AI was already good enough to catch this. The fix is infrastructure -- a score, a threshold, a check run. Turn findings into a gate. That's what we're building next.

We're not claiming we would have prevented this bug. A gate changes the default, but a determined developer can override any gate. What we are claiming is simpler: the information was there and the system didn't use it. That's a solvable engineering problem.

References

Original PR

vercel/next.js #91729

Revert PR

vercel/next.js #92320

Reviewers

Senior Code Reviewer, Security Engineer, Architecture Reviewer

Verdict

PASS (no gate configured)

Catch the bugs your team scrolls past

Ensemble runs multiple independent AI reviewers on every PR. Trust Score and merge gates are coming soon. Install now and the gate will be there when it ships.

Install Ensemble Read: multi-model experiment

Read-only access · Code deleted after review · Free for public repos