We missed a critical Deno bug. Then we fixed our pipeline and caught it.

From 0 issues to 18 -- how upgrading 6 review prompts turned a miss into a catch.

The miss

Deno PR #32959 -- "fix(ext/node): support numeric FDs in child_process stdio array" -- looked like a reasonable compatibility fix. Deno's child_process wasn't handling numeric file descriptors in the stdio array the way Node does. The PR fixed that.

We ran it through Ensemble. Three independent AI reviewers analyzed the diff. All three said PASS. Zero issues found.

issues found
on first run

3/3

reviewers
said PASS

days until
revert

The PR introduced a serious bug. It changed how numeric values in the stdio array were interpreted -- from Rid(ResourceId) (Deno's internal resource table ID) to Fd(i32) (raw OS file descriptor). Same integer type, completely different meaning. This broke every tool that relied on Deno's resource ID semantics for child process stdio, including Claude Code. It was reverted two days later.

Here's the core of the change:

The semantic change (simplified)

// ext/node/ops/child_process.rs
// How numeric stdio values were deserialized:

- StdioOrRid::Rid(ResourceId)  // Deno resource table ID
+ StdioOrRid::Fd(i32)          // Raw OS file descriptor

// The integer 3 used to mean "Deno resource #3"
// Now it means "OS file descriptor 3"
// These are completely different things.

Downstream: dup() now operates on wrong value

fn as_raw_fd(&self, state: &mut OpState) -> Result<i32> {
  match self {
-   StdioOrRid::Rid(rid) =>
-     Ok(FileResource::get_fd(state, *rid)?)
+   StdioOrRid::Fd(fd) =>
+     Ok(*fd)  // skips resource lookup entirely
  }
}
// dup() is then called on this fd.
// If the value was a resource ID, not a real fd,
// dup() either fails or clones the wrong descriptor.

Our reviewers saw the diff. They saw the rename. They said it looked fine. They were wrong.

Why we missed it

We investigated the root causes and found three specific prompt-level failures.

Root cause 1

"Look ONLY at the diff"

Our reviewer prompts explicitly instructed models to be efficient: focus on the diff, don't explore beyond the changed files. This prevented them from questioning what Rid vs Fd meant upstream or downstream. The rename looked like a cosmetic cleanup because the models never looked at callers.

Root cause 2

No "contract change" detection

None of the reviewer prompts instructed models to flag when a type changes what values mean, not just their shape. ResourceId and i32 are both integers. The type signature didn't change in a way that trips a compiler error. But the semantic contract -- what that integer represents -- changed completely. No prompt told the reviewers to look for this.

Root cause 3

No type confusion in security checks

The Security Engineer persona looked for authentication bypasses, injection, input validation. It didn't look for "an integer that used to mean one thing now means another." Type confusion -- where a value's semantic identity changes while its machine representation stays the same -- wasn't in its threat model.

The common thread: all three failures were prompt gaps, not model capability gaps. The models could have caught this. We didn't tell them to look for it.

The fix: 6 prompt changes, same day

We made six changes to the review pipeline within hours of the investigation.

Contract change detection added to the Senior Code Reviewer: "When a type changes what values mean, verify all callers handle the new semantics"
Assumptions check: "When a flag is flipped or a guard removed, ask why it was that way before"
Type confusion added to the Security Engineer: "When a value's semantic meaning changes while its representation stays the same, flag it"
Softened diff-only constraint: Changed "look ONLY at the diff" to "start with the diff, but investigate callers when you see a contract change"
New Skeptic Reviewer persona whose job is to argue the previous code was correct
Language-aware model floor: skip the cheap model for Rust, C, and Go files -- these languages have enough subtlety that the cheapest models miss critical context

Then we re-ran the same PR through the upgraded pipeline.

The catch

Before (v1 pipeline)

0 issues

3 reviewers, all PASS

After (v2 pipeline)

18 issues

4 reviewers, multiple CRITICAL flags

Security Engineer

CRITICAL: Type confusion / Memory safety

"Type Confusion / Security: The change from Rid to Fd fundamentally alters what numeric values in the stdio array represent. Code calling dup() on what it believes is a raw fd, but is actually a Deno resource ID, will duplicate the wrong file descriptor -- or an arbitrary one."

Senior Code Reviewer

Flagged semantic contract change

"BEFORE: Numbers deserialized as ResourceId (Deno's internal resource table ID). AFTER: Numbers deserialized as Fd(i32) -- raw file descriptor. This is a semantic contract change. All callers that passed resource IDs as numeric stdio values will now have those values reinterpreted as OS file descriptors."

Architecture Reviewer

Flagged Resource ID vs File Descriptor mismatch

"The PR conflates two distinct identifier namespaces. Deno resource IDs are indices into an internal resource table; file descriptors are OS-level handles. Treating one as the other silently produces wrong behavior without any error."

The contract change detection -- root cause #2 in the investigation -- worked exactly as designed. The Senior Reviewer explicitly named the before/after semantics. The Security Engineer classified it as type confusion. The Architecture Reviewer flagged the namespace collision.

The mechanism: Three independent reviewers, with upgraded prompts, converged on the same finding from different angles. That's the ensemble signal. Not one model being smart -- three models independently concluding "this integer doesn't mean what it used to mean."

What still doesn't work

We're not writing a victory lap. Here's what failed.

The Skeptic Reviewer found 0 issues on this PR. We added it specifically for cases like this, and the other three reviewers caught it instead. The Skeptic may prove useful on different bug patterns, but it didn't contribute here.

We tested against Rust PR #154840 -- a 2-line config flag flip that broke windows-gnu builds. The upgraded pipeline still missed it. That bug requires git commit history to understand why the flag was originally disabled. Our public trial mode doesn't have repo access, so the model couldn't investigate the history.

18 issues is a lot of noise. Not all 18 findings were high-signal. The upgraded prompts cast a wider net, which means more false positives alongside the real catches. We need better severity calibration.

We're not claiming 100% catch rate. That's not the point. The point is: when we miss, we investigate, we improve, and we re-run. The pipeline gets better because the failures are specific and fixable.

The thesis

Most AI code review tools run the same prompt on every PR, forever. If the prompt misses a bug class, it will miss it every time. The prompts become a fixed ceiling on what the tool can catch.

We think AI code review should work more like a team of engineers: when something gets through, you do a post-mortem, you update your process, and you verify the fix against the original failure. Not once. On every miss, forever.

This Deno PR was a miss. The investigation took hours, not days. The fix was 6 prompt changes. The verification was re-running the same PR and seeing 18 issues where there were 0. That cycle -- miss, investigate, fix, verify -- is the product.

AI code review should get better over time, not just run the same prompt forever. Every missed bug is a prompt gap. Every prompt gap is fixable. The question isn't whether AI can catch bugs -- it's whether the system learns from the ones it misses.

References

Original PR

denoland/deno #32959

Revert PR

denoland/deno #33017

Reviewers (v1)

Senior, Security, Architecture

Reviewers (v2)

Senior, Security, Architecture, Skeptic

AI review that learns from its mistakes

Ensemble runs multiple independent AI reviewers on every PR. When it misses, we investigate and upgrade the pipeline. Install it and the reviews get better over time.

Install Ensemble Read: Next.js OOM catch

Read-only access · Code deleted after review · Free for public repos