The bug
Next.js PR #91729 -- "simplify session dependent tasks and add TTL support" -- landed in one of the most-used open source projects on GitHub (139k stars). 25 files changed. +703/-288 lines. Routine refactor with a new TTL mechanism for session-dependent cache invalidation.
Three days later, applications with persistent cache started crashing. Out-of-memory. The root cause: unbounded allocation from invalidation cascading during session restore. When a session was restored, the invalidation logic spawned timers that held references to invalidators. Multiple fetch re-executions before the TTL deadline meant multiple timers stacking up, each holding memory that could never be reclaimed. Under persistent cache workloads, this grew without bound until the process ran out of memory.
The fix was a revert: PR #92320, merged 3 days after the original.
before revert
+703/-288 lines
flagged the risk
This is a normal story. Big project, reasonable PR, subtle bug that only manifests under specific conditions. What's unusual is that we have a record of what Ensemble's AI reviewers said about it before it merged.
What Ensemble found
Ensemble ran 3 independent reviewers against this PR: a Senior Code Reviewer, a Security Engineer, and an Architecture Reviewer. Each reviews the diff independently, with different system prompts and different concerns. Here's what they said.
Three independent reviewers. All three flagged something related to the same area of the code -- the timer/invalidation/TTL logic. The Senior Reviewer and Architecture Reviewer both specifically called out unbounded timer accumulation. The Security Engineer flagged a related numeric issue in the same TTL code path.
The actual production bug? Unbounded allocation on session restore. That's almost word-for-word what the reviewers flagged.
The gap
So the system flagged it. All three reviewers raised concerns about the exact code path that broke. Why did it ship?
Because Ensemble marked it PASS.
Right now, Ensemble has no Trust Score threshold. No merge gate. Reviews run, findings get generated, and the system outputs a pass/fail verdict based on... nothing, really. There's no numeric threshold that says "if 3 out of 3 reviewers flag resource exhaustion, block the merge." The concerns were surfaced as comments. Comments that live alongside dozens of other AI-generated observations. Comments that look like every other code review suggestion.
This is the core problem: findings without a gate are just more noise.
Developers already ignore most automated comments. If you've used any AI code review tool, you know the pattern -- the bot posts 15 comments, 13 are style nits, 1 is mildly useful, and maybe 1 matters. After a few PRs, you train yourself to scroll past all of them. The signal-to-noise ratio makes it rational to ignore the output.
We had the right findings and the wrong delivery mechanism. The reviewers did their job. The system didn't do its job -- because the system's job is to turn findings into decisions, and we hadn't built that part yet.
What we're building
The lesson from this PR is straightforward: detection without enforcement is a suggestion. We need to close the loop.
Trust Score (0-100)
Every PR gets a numeric score derived from the ensemble output. Not a vibes-based "looks good" -- a computed score that accounts for:
- How many independent reviewers flagged the same concern (consensus weight)
- Severity classification of each finding (resource exhaustion > style nit)
- Whether findings cluster around the same code path (correlated risk)
- Whether the PR touches high-risk patterns (concurrency, memory management, auth)
In this case, 3/3 reviewers flagging resource-related issues in the same timer/invalidation code path would have produced a low Trust Score. Not because any single finding was a showstopper, but because the convergence across independent reviewers indicates real risk.
GitHub Check Run
The Trust Score gets reported as a GitHub Check Run -- the same mechanism used by CI, linters, and test suites. If the score falls below the repo's configured threshold, the check fails and the PR cannot merge (assuming branch protection requires passing checks).
This is the difference between "we posted a comment" and "we blocked the merge." The developer can still override it -- dismiss the check, lower the threshold, merge with admin privileges. But the default path changes from "ignore and merge" to "acknowledge the risk and proceed."
The 3 reviewers flag resource exhaustion, race conditions, and overflow risk in the same code path. The Trust Score computes to, say, 38/100. The repo threshold is set to 65. The GitHub Check fails. The PR shows a red X.
The author sees: "Ensemble: Trust Score 38 -- 3/3 reviewers flagged unbounded resource allocation in session restore path." They can still merge. But they have to make a conscious decision to override the gate, which is a fundamentally different interaction than scrolling past a comment.
The honest assessment
We're not claiming we would have prevented this bug. A gate changes the default, but a determined developer can override any gate. What we are claiming is simpler: the information was there and the system didn't use it. That's a solvable engineering problem.
References
Catch the bugs your team scrolls past
Ensemble runs multiple independent AI reviewers on every PR. Trust Score and merge gates are coming soon. Install now and the gate will be there when it ships.