Stop Asking the Model What the Code Already Knows

Non-determinism in an LLM-based agent pipeline is usually framed as a model problem. Better model, better prompt, better temperature setting. In practice, most of the non-determinism I see in a code-generation pipeline is a schema problem: the pipeline asks the model to emit fields the codebase already knows, and every one of those fields is a chance for the model to hallucinate.

Every field an agent pipeline asks an LLM to emit is a Bernoulli trial. At ninety-five percent per-field success and twenty such fields per manifest, the joint probability of a fully correct manifest is about 36 percent. (Originally published as 60 percent, which is the correct figure at ten fields, not twenty. Corrected April 2026 alongside a follow-up post measuring 248 runs against the model.) That matches what it feels like when every third ticket fails for no obvious reason.

The fix is not a better model or a tighter prompt. It is to stop asking the model for fields the codebase already knows, and to supply those fields directly from a symbol registry, a Tree-Sitter parse, and a language server (LSP) instead.

Facts versus decisions

The rule for deciding which fields a Planner should emit is simple.

Every fact a model copies is a fact it can hallucinate. The Planner decides what to build. The pipeline supplies what exists. When the two conflict, machine-extracted data wins over model-generated constraints, which win over model narrative.

Most of the fields a Planner currently emits are facts, not decisions: the path a symbol lives at, the signature of the function it calls, the module a dependency resolves through. The Planner is making judgment calls about decomposition and intent. In the same output it is also being asked to remember the repo’s file layout from the prompt it was given. It does the first job well and the second job probabilistically. The probabilistic part is where every third manifest fails.

The infrastructure to make those fields deterministic is already in most coding agents’ stacks. It is just not wired to replace the model’s emission of them. Tree-Sitter gives an abstract syntax tree (AST) with exact byte ranges. A symbol registry built on top of it gives O(1) symbol-to-file lookups. A language server gives cross-file semantics: definitions, references, incoming calls, type resolution across barrel re-exports. Between those three primitives, most of the fields a Planner hallucinates are already sitting in process, addressable by a lookup.

The case where machine extraction worked: blast-radius callers

The first field removed from the Planner’s output was the list of blast-radius callers.

Before the change, a Planner reading a ticket like “change how getAllPosts returns data” had to guess which files called getAllPosts. Sometimes it got it right from context. Sometimes it missed search.ts because the ticket did not mention search, and the Coder shipped a return-type change that broke the type checker eleven places over. The Debugger then spent thirty thousand tokens recovering from a question the language server can answer in one request.

After the change, the registry holds the symbol, the language server holds the callers, and the Planner is never asked to emit the list. The pipeline reads it.

The run that proved it was a pagination ticket on a TypeScript blog fixture. Pre-LSP: eleven typecheck errors in search.ts that the Planner had missed as a caller, a Debugger cycle, 92,618 tokens, 122 seconds. Post-LSP: zero search.ts errors, clean typecheck, 95,170 tokens, 97 seconds. The blast-radius overhead of thirty-three thousand extra input tokens and seventy-four extra seconds disappeared from the post-LSP run. The Debugger still fired on the second run, but on a genuine logic error, a validation check placed after a clamping function so it could never fire. That is the kind of bug a Debugger is for. Type errors from a missed caller are not. A separate earlier post (“What Calls This Function?”) goes through the full before-and-after for readers who want the stage-by-stage breakdown.

The pattern generalises. Any time the Planner is emitting a fact that Tree-Sitter, the registry, or the language server can compute directly, the field can be removed from the prompt’s output schema entirely. The model does not emit it. The pipeline materialises it. There is no version where the model gets it wrong, because the model is not the one answering the question.

The case still teaching me where the line is: same-module lexical bindings

Not every field is as clean to mechanise as blast-radius callers.

A recent run produced a manifest that declared a mock target at the module path vi.mock("@api/pipeline/index", ...). That module is a barrel re-export of ./processor. The test then imported the real processItem from the source module and asserted on mock call counts for its siblings summarizeItem and buildOutput. The pipeline shipped a correct Coder implementation. Two tests still went red: processItem’s internal call to summarizeItem is a same-module lexical binding, and vi.mock cannot intercept those regardless of which module it is pointed at.

The cost of discovering this at test-run time rather than at manifest-validate time was a full Test Writer cycle, a full Coder cycle, and two retry attempts on tests that could never go green on this source topology. Roughly forty thousand tokens and several minutes of wall clock, per incident, burned on work that was doomed from the moment the criterion was written. Multiply that by how often the class fires across a real project and the case for catching it pre-builder is straightforward.

The obvious-looking mechanisation, “let the pipeline derive the mock target from the registry,” does not fix this. Pointing the mock at the source module either leaves the binding problem intact, or, if it also replaces processItem with a spy, silently breaks the test’s real import. Both paths are structurally unsatisfiable.

The actual defect is not in the mock-target field. It is in the shape of the criterion. An assertion that a function “does not call summarizeItem” is unachievable on this source topology. The fix is to reject the criterion at manifest-validate time with a structured alternative. Rewrite as an HTTP status assertion. Rewrite as a database state assertion. Drop as out of scope. The only way to reject that deterministically is to resolve the subject and object of the criterion against the registry and check whether they share a source file. When they do, no mock target saves the test.

That is still machine extraction. It just operates on the criterion’s semantics, not the manifest’s surface syntax. And it is the piece I am currently designing rather than running. Honest statement of where this sits at the moment of writing. The registry, the reach data, and the language server are live. The feasibility gate that consumes them to reject unsatisfiable criteria is on paper. The scoped plan I was going to start from, mechanise one field, the mock target, turned out not to close this failure class. Working through why produced the design I actually want to build.

How machine extraction differs from output validation

Machine extraction is not the same as “add a validator.” Validators run after the model has emitted a value and ask whether it was right. A good validator is a backstop. A great pipeline does not need it most of the time. Machine extraction means the model never emits the value. The field is not in the prompt’s output schema. There is no value to validate, because there was no choice to make.

That shift, from “check the model’s work” to “do not give the model the work,” is where the determinism comes from.

A validator at ninety-five percent recall still lets the three-percent class through. A field removed from the output surface lets zero through, because the surface is smaller.

A concrete example of the replacement. The pipeline used to emit a list of “files the Coder needs to read” from the Planner, and a validator checked that every called symbol’s defining file was in the list. Missed entries meant a rejection and a repeat Planner call. The validator caught most of them. It also loved to argue with the Planner about files the Planner had correctly judged as irrelevant and the registry said otherwise. The field no longer exists in the Planner’s output. The list is computed directly from the registry’s call graph, starting at the symbols the Planner decided to modify. The validator went with it. Nothing to argue about, because nothing was emitted.

Every LLM-emitted field also carries a maintenance cost. When the schema changes, the prompt changes, the validator changes, the retry logic changes, the regression tests change. A field removed from the output surface is one fewer moving part in every one of those places. That is a separate win from the determinism one, and it compounds in the same direction.

The risks are real and worth naming. A field can look mechanical but turn out to carry a legitimate decision, and removing it strips judgment from a place that needed it. A field can be pre-populated correctly on the fixture it was tested against and wrong on the first polyglot project. Both have happened here. The answer is the same as with any other architectural constraint. Name what the primitive covers. Name what it does not. Write the incident down. Decide next time with the catalogue in hand.

The direction, though, is one-way. Every field moved from “the model emits it” to “the registry, the language server, or the syntax tree supplies it” is a dice roll that never has to be re-rolled. The next two fields on the list are the mock target, which moves once the feasibility gate has a first version running, and the dependency signatures, which the registry already stores and the Planner currently rewrites from the context it was given. Over a long enough pipeline, the compounding is the whole game. The architectural principle behind that direction is The Lego Instructions.

The pipeline runs inside Docker on real tickets against a TypeScript fixture and a ~100k line TypeScript monorepo. Numbers and incidents are from actual runs. The feasibility-gate piece is an active design, not a shipped subsystem, and is called out as such where it matters. Still R&D.