Per-Field Hallucination Fixes Hit a Ceiling: 248 Runs on an AI Coding Agent

The simple model for LLM hallucination in an AI coding agent pipeline is a Bernoulli trial. Each field the Planner emits into a manifest is an independent roll of a weighted die. At twenty fields and a 0.95 per-field success rate, the joint probability of a fully correct manifest is 0.95 to the twentieth, roughly 36 percent. (An earlier post originally quoted this as about 60 percent, which is the math for ten fields; 36 percent is the correct number for twenty. That post is being corrected alongside this one.) That earlier post used this framing to argue that per-third-ticket failure in an agent pipeline is a schema problem, not a model problem. I went back through the run archive to see how close the model came to reality.

Across 248 tickets run through the pipeline between April 1 and April 14, 2026, the first-pass success rate was 21 percent. The prediction missed by 15 points.

The gap has two plausible explanations, and both are probably true. Either per-field accuracy is lower than 0.95 in practice, or the fields are not independent. One wrong fact cascades into multiple broken acceptance criteria, and a single hallucination in the manifest takes several downstream checks with it when it fails. Together they set a ceiling on how far per-field accuracy work alone can carry the pipeline.

What the 248 runs actually show

The data covers several architectural iterations of the pipeline, so it is a historical aggregate, not a controlled A/B. The tickets come heavily from a small fixture family, which means the absolute rate is fixture-contingent. A project with a deeper import graph could push it higher; a simpler one could push it lower. The shape of the result, which classes of failure dominate, does not depend on the fixture. Field-level hallucination is a property of how the Planner interacts with the manifest schema, not of what any particular ticket asks for.

The breakdown:

21 percent (53 of 248) completed without the Debugger running at all. The manifest was accurate enough that the Test Writer, the Coder, and the Reviewer produced passing, reviewed code on first pass.
A further 3 percent recovered after the Debugger diagnosed the failure and the Planner re-planned. Total end-to-end completion: 24 percent.
27 percent reached the Debugger and failed anyway. The Debugger named a cause, but no valid manifest emerged within the attempt budget.
The remaining 49 percent stalled before the Coder stage. Planner timeouts, pre-scope selection failures, iteration pauses. Not all are builder failures, but none produced working output.

Horizontal stacked bar chart breaking 248 pipeline runs into four segments: 21 percent first-pass complete (53 tickets), 3 percent recovered after a Debugger re-plan (7 tickets), 27 percent Debugger-diagnosed failures (67 tickets), and 49 percent stalled before the Coder stage (121 tickets). A callout above reads that the Bernoulli hallucination model predicted 36 percent first-pass success.

Why per-field fixes have a ceiling

Of the 67 tickets where the Debugger diagnosed a failure in detail, about two-thirds were field-level hallucinations. The Planner emitted a wrong filepath, a missing export, a bad function signature, or a type field that did not exist on the object the test was written against. The Coder followed the bad instruction faithfully into a wall. This is exactly the class of failure machine extraction is designed to eliminate, and on the fields it has been applied to, it does. Every field moved from “the Planner emits it” to “the registry computes it” is a dice roll that never has to be re-rolled.

Every field moved from “the Planner emits it” to “the registry computes it” is a dice roll that never has to be re-rolled. But not every failure is a bad dice roll.

The other third is where the ceiling comes from.

Some failures are instruction ambiguity. The Planner’s scope field leaves room for the Coder to interpret “implement this new function” as “replace the file with this new function.” One such incident took out five working exports from a single module and cascaded into 44 failing tests across three suites. That is not a field-accuracy problem. Rewriting the filepath field does not stop it.

Others are structurally unsatisfiable criteria. An earlier post used the vi.mock same-module-binding example: a test criterion that no builder can satisfy regardless of which field values are plugged in. Machine extraction of the mock target does not fix that either.

At 20 fields and 0.95 per-field accuracy, moving five fields from LLM emission to machine extraction lifts the joint probability from 36 percent to 46 percent on the model. That is a real gain. But the observed first-pass rate already sits below the 36-percent prediction, and the failures that will remain after per-field work are not on the same curve. Another round of per-field fixes will not close the gap. The remaining third needs structural validators, not tighter prompts.

The Lego Instructions

A Lego instruction set has three properties:

Every piece is in the box. You never need to find a piece. The instruction tells you exactly which bag it’s in, what color, what shape, and where it connects. The builder never hunts.
The instructions are unambiguous. Step 14 never says “attach something blue-ish somewhere near the top.” It shows the exact piece, exact position, exact orientation. The builder never decides.
Impossible assemblies don’t appear. No instruction asks you to connect two pieces that physically can’t click together. The instruction designer verified fit before publishing.

Mapped to an agent pipeline:

Every piece in the box: file paths, symbols, signatures, dependency resolutions are machine-extracted and populated into the manifest by a Synthesizer, not requested from the Planner.
Unambiguous steps: each manifest field has a single source of truth. The Test Writer and the Coder never choose between two plausible interpretations of the same field.
No impossible assemblies: a feasibility gate resolves each acceptance criterion against the registry and the language server before any builder fires, and rejects criteria that no test could satisfy on the current source topology.

The principle that follows is that the Planner’s job is to make the builders’ job trivial. Whenever the Coder is searching for a file, locating a symbol, or picking between two plausible interpretations, that is brain work leaking downstream. The manifest was incomplete. The fix is to push the work back up to the stage that has the context to answer the question deterministically, not to prompt the builder harder. The architectural reasoning behind this principle is developed in The Lego Instructions.

What comes next

The design currently on paper splits the manifest into two disjoint surfaces. One surface holds the Planner’s actual decisions: what to build, which symbols are in scope, what the acceptance criteria are. The other surface holds the facts: file paths, signatures, dependency resolutions, mock targets. The Planner only writes to the first. A Synthesizer populates the second from the registry, the language server (LSP), and Tree-Sitter before any builder sees the manifest. A feasibility gate runs in parallel, resolving each acceptance criterion against the same sources and rejecting unsatisfiable ones at manifest-validate time, with a structured alternative, before a single token is spent on the Test Writer.

The gate is on paper. The registry, the language server, and the call-graph primitives are already live and already used to populate individual fields. The thing being built next is the gate itself and the split-surface manifest that feeds it. Whether it holds up on the next 248 runs is the measurement worth the follow-up post.

The pipeline runs on real tickets against a TypeScript fixture and a ~100k line TypeScript monorepo. The 248 runs are drawn from the pipeline’s run archive. The feasibility gate described above has since shipped; the measured results belong in a follow-up post. Still R&D.