The miss
Deno PR #32959 -- "fix(ext/node): support numeric FDs in child_process stdio array" -- looked like a reasonable compatibility fix. Deno's child_process wasn't handling numeric file descriptors in the stdio array the way Node does. The PR fixed that.
We ran it through Ensemble. Three independent AI reviewers analyzed the diff. All three said PASS. Zero issues found.
The PR introduced a serious bug. It changed how numeric values in the stdio array were interpreted -- from Rid(ResourceId) (Deno's internal resource table ID) to Fd(i32) (raw OS file descriptor). Same integer type, completely different meaning. This broke every tool that relied on Deno's resource ID semantics for child process stdio, including Claude Code. It was reverted two days later.
Here's the core of the change:
// ext/node/ops/child_process.rs // How numeric stdio values were deserialized: - StdioOrRid::Rid(ResourceId) // Deno resource table ID + StdioOrRid::Fd(i32) // Raw OS file descriptor // The integer 3 used to mean "Deno resource #3" // Now it means "OS file descriptor 3" // These are completely different things.
fn as_raw_fd(&self, state: &mut OpState) -> Result<i32> { match self { - StdioOrRid::Rid(rid) => - Ok(FileResource::get_fd(state, *rid)?) + StdioOrRid::Fd(fd) => + Ok(*fd) // skips resource lookup entirely } } // dup() is then called on this fd. // If the value was a resource ID, not a real fd, // dup() either fails or clones the wrong descriptor.
Our reviewers saw the diff. They saw the rename. They said it looked fine. They were wrong.
Why we missed it
We investigated the root causes and found three specific prompt-level failures.
Rid vs Fd meant upstream or downstream. The rename looked like a cosmetic cleanup because the models never looked at callers.
ResourceId and i32 are both integers. The type signature didn't change in a way that trips a compiler error. But the semantic contract -- what that integer represents -- changed completely. No prompt told the reviewers to look for this.
The common thread: all three failures were prompt gaps, not model capability gaps. The models could have caught this. We didn't tell them to look for it.
The fix: 6 prompt changes, same day
We made six changes to the review pipeline within hours of the investigation.
- Contract change detection added to the Senior Code Reviewer: "When a type changes what values mean, verify all callers handle the new semantics"
- Assumptions check: "When a flag is flipped or a guard removed, ask why it was that way before"
- Type confusion added to the Security Engineer: "When a value's semantic meaning changes while its representation stays the same, flag it"
- Softened diff-only constraint: Changed "look ONLY at the diff" to "start with the diff, but investigate callers when you see a contract change"
- New Skeptic Reviewer persona whose job is to argue the previous code was correct
- Language-aware model floor: skip the cheap model for Rust, C, and Go files -- these languages have enough subtlety that the cheapest models miss critical context
Then we re-ran the same PR through the upgraded pipeline.
The catch
The contract change detection -- root cause #2 in the investigation -- worked exactly as designed. The Senior Reviewer explicitly named the before/after semantics. The Security Engineer classified it as type confusion. The Architecture Reviewer flagged the namespace collision.
What still doesn't work
We're not writing a victory lap. Here's what failed.
We're not claiming 100% catch rate. That's not the point. The point is: when we miss, we investigate, we improve, and we re-run. The pipeline gets better because the failures are specific and fixable.
The thesis
Most AI code review tools run the same prompt on every PR, forever. If the prompt misses a bug class, it will miss it every time. The prompts become a fixed ceiling on what the tool can catch.
We think AI code review should work more like a team of engineers: when something gets through, you do a post-mortem, you update your process, and you verify the fix against the original failure. Not once. On every miss, forever.
This Deno PR was a miss. The investigation took hours, not days. The fix was 6 prompt changes. The verification was re-running the same PR and seeing 18 issues where there were 0. That cycle -- miss, investigate, fix, verify -- is the product.
References
AI review that learns from its mistakes
Ensemble runs multiple independent AI reviewers on every PR. When it misses, we investigate and upgrade the pipeline. Install it and the reviews get better over time.