# Don't Code This At Home > Technical blog by Erik Perttu on autonomous engineering pipelines and AI systems. Summary index: [llms.txt](https://dontcodethisathome.com/llms.txt) ## About - [About Erik Perttu](https://dontcodethisathome.com/about): Swedish engineer based in Ho Chi Minh City. A decade shipping production software to a million users. Building autonomous engineering pipelines. ================================================================================ # No Stage Runs Forever: Retry Budgets and Escalation in an Agent Pipeline URL: https://dontcodethisathome.com/no-stage-runs-forever-retry-budgets-and-escalation-in-an-agent-pipeline Date: 2026-04-30 Description: How per-stage retry budgets, wall-clock timeouts, and a global token cap keep any stage from running indefinitely, with the Debugger as the most complex case. Tags: AI Engineering, Autonomous Engineering Pipeline, Agent Safety Dependencies: AI Agent Pipelines, LLM APIs, TypeScript Most write-ups of agent pipelines stop at the happy path. The Planner produced a manifest, the Coder wrote code, the tests passed. What happens when a stage fails, and what stops the pipeline from burning the entire token budget on a stuck retry loop, is the part the public literature on agentic pipelines mostly skips. The honest version of the design lives in those failure boundaries, not in the diagram of the happy path. The unwritten rule of any pipeline that runs without a human in the loop is that no stage runs forever. Every stage has a retry budget. Every retry has a wall-clock cap. The whole run sits inside a global token cap. When any of those is exhausted, the pipeline halts and the failure surfaces, rather than quietly spending more money on attempts that are not converging. This post walks through how that boundary is configured, stage by stage, on the pipeline I am building. The Debugger is the showcase, because it has the most failure complexity, but the same shape applies everywhere. ## The contract: bounded at every layer Three layers of bound run on every ticket: The outermost is the per-run token cap. The pipeline counts input tokens minus cache reads, plus output tokens, against that ceiling on every model call. When the running total crosses it, the call that would have exceeded the budget raises and the run aborts. The cap exists so a runaway stage cannot drain a wallet on its own. The next layer is the per-stage wall-clock timeout. Every stage has a timeout, in seconds, and the pipeline refuses to start if any stage is missing it. There is no hidden default. A missing wall-clock cap is the kind of thing that looks fine in a happy-path run and [silently lets a stuck stage hang for half an hour](https://dontcodethisathome.com/why-a-warning-is-worse-than-a-hard-stop) the first time something goes wrong. The innermost is the per-stage retry budget. The Coder gets three attempts by default. The Reviewer's redo cycles are capped. The Planner gets one re-plan when the Debugger routes a manifest issue back to it. Each one is a different field, but the shape is the same: a bounded number of attempts, a defined exit when the budget is gone. ![Concentric rings diagram showing three nested failure boundaries on every ticket. Outer ring labelled PER-RUN TOKEN CAP. Middle ring labelled PER-STAGE WALL-CLOCK TIMEOUT. Inner ring labelled PER-STAGE RETRY BUDGET. Three dashed arrows originate from the centre and extend outward, labelled HALT, SURFACE TO HUMAN, and ROUTE UPSTREAM.](https://dontcodethisathome.com/images/retry-budgets-bounded-at-every-layer.svg) ## Per-stage configuration is not optional Every stage in the pipeline has its own configuration entry. A stage entry pins the vendor surface (which CLI or SDK the call goes through), the specific model within that vendor, the wall-clock timeout, the retry budget where applicable, the rung the stage moves to on retry, and what the pipeline does when the budget is exhausted. Stages can be on different vendors and different models in the same run. The Planner could run on Gemini Pro while the Debugger runs on Claude Sonnet, if cost and capability per stage suggest that split. The shape, in pseudocode (illustrative, not a 1:1 with the live config): ```yaml planner: vendor: anthropic model: opus # the most capable model in the run timeout: 300s retry_budget: 1 # one re-plan on a routed manifest issue on_exhaust: surface_to_human coder: vendor: anthropic model: sonnet timeout: 600s retry_budget: 3 on_retry: compute: increased # more compute on the same model on_exhaust: halt debugger: vendor: anthropic model: sonnet timeout: 300s attaches_to: any_stage # invoked on demand wherever runtime evidence exists on_retry: compute: increased can_route_to: any_stage # mini-brain: routes the fix to whichever upstream stage owns it reviewer: vendor: google model: gemini-flash # cheap-and-fast for a scanning role timeout: 300s retry_budget: 1 ``` The point of the snippet is the axes, not the syntax. Five stages, the same shape repeated, different values per stage. The vendor and model are independent of each other and independent of the other stages. The retry budget is bounded per stage. The retry rung names what changes on the next attempt. The exhaust behaviour names what happens when the budget is gone, and the answer is never "keep going." Configuring stages independently is not cosmetic. The right model, the right cost-per-call, and the right retry budget depend on the work, not on a uniform setting. A previous post on [pipeline bugs that only surface at lower model effort](https://dontcodethisathome.com/four-pipeline-bugs-that-only-surface-at-lower-model-effort) showed what goes wrong when there is one global knob: lowering it exposes structural bugs the higher setting was silently compensating for. The fix in that case was structural. The per-stage config is the surface that makes the fix expressible. The Reviewer can run on a cheaper, lower-effort setting and the Coder can run with a higher escalation rung, in the same run, against the same ticket, because the failure profiles of those stages are not the same. ## The Debugger: most failure complexity in one stage The Debugger is the most interesting stage to look at, because it is the one that runs precisely because something else has already failed. It does not fail on its own, and it does not sit at a fixed step in the pipeline. The Debugger is invoked on demand from any stage that has hit a failure with runtime evidence to inspect, and it returns a diagnosis the calling stage acts on. The most common attachment point is the Coder's retry loop, where the flow is: the Coder writes code, the tests run, the tests fail, the Debugger diagnoses, the Coder retries on the next attempt. But the same call shape applies anywhere a stage produces output that fails a runtime check. The Debugger acts as a mini-brain across stages, not as a sub-stage of any one of them. It is the only stage with access to runtime evidence: the actual test output, the type-checker errors, the code on disk, the failure shape from whichever stage called it. It uses that evidence to decide which upstream stage owns the fix. A diagnosis can route to the Coder ("the implementation is wrong, here is a corrective brief and what to avoid on the next attempt"), or to the Planner ("the manifest is broken, no Coder retry will fix it"). The same diagnostic call routes the work to whichever upstream stage is the actual source of the problem. That routing role, plus the on-demand attachment, is the difference between a retry helper and a mini-brain. It is also why the Debugger can rescue runs that no single stage retrying on its own could. So "the Debugger fails" can mean three different things, and each one is configured separately. The first kind of failure is that the diagnosis was right, the Coder followed it, and the tests still failed. A sharper diagnosis on the next attempt might rescue the run. That is what the escalation rung is for. The first Debugger call inside a stuck Coder loop runs at the base configuration; the retry runs at a more expensive one, reserved for the runs that did not yield to the cheaper diagnosis. The expensive rung might be more compute on the same model, or a stronger model entirely. The schema does not care which. The second kind is that the diagnosis was wrong, and the Coder is now off track. The retry budget contains the damage. Three attempts is a hard ceiling. The pipeline does not retry forever, and it does not extend the budget when the Debugger thinks it is close. The third kind is the most interesting. The Debugger is correct that the manifest itself is broken, and no amount of Coder retries will fix it. The earlier post on [why the Debugger never inherits the Coder's reasoning](https://dontcodethisathome.com/why-the-debugger-never-inherits-the-coders-reasoning) covers the input contract that lets the Debugger see this clearly. It also covers the failure class the Debugger cannot handle at all: missing callers outside its authorized scope surface as pipeline halts, not re-plans. What routes back to the Planner is different, a case where the manifest content is wrong but the failure is inside the Debugger's scope. When that is the diagnosis, the pipeline exits the Coder loop early, sending the finding upstream to the Planner for a re-plan instead of burning the remaining attempts. That last path is the difference between a retry budget and an escalation strategy. A retry budget caps cost. An escalation strategy decides where to spend the next attempt. Inside the Debugger alone, both are configured: a more expensive call when the cheaper diagnosis did not land, and an early exit from the calling stage's loop when staying inside it is provably wasted. ## The same shape applies to every other stage The Coder has three attempts, with a more expensive configuration on attempts two and three. When the third fails without a manifest-issue diagnosis, the run halts. The Reviewer has a cap on Coder redo loops triggered by review verdicts. The Reviewer can request changes; it cannot request changes indefinitely. When the cap is reached the run is closed, with the latest reviewed state preserved. The Planner has a one-shot re-plan when the Debugger routes a manifest issue back to it. The re-plan gets the Debugger's diagnosis as input. After that, if the rebuilt manifest still does not produce a passing run, the pipeline surfaces the failure to a human. The Planner does not get a third try. None of these caps is the most expensive part of running the pipeline. They are usually idle. They exist for the runs where something has already gone wrong and the next call is not going to fix it. ## Escalation in practice: bad upstream context, not a capability ceiling The temptation when a stage gets stuck is to throw a stronger model at it. The retry rung exists for the cases where that instinct is right. The observation from running this pipeline against real tickets is that the instinct is usually wrong. When a stage fails repeatedly, the cause is almost always bad context coming in from upstream, not a capability ceiling on the stage itself. A Coder that cannot make tests pass after two attempts has not usually run out of model capability. It has been handed a manifest that is missing a symbol, or constraints that contradict the type system, or a test that asserts behaviour the data model cannot produce. Swapping the Coder to a more expensive model on attempt three lets the model burn more compute trying to bridge the gap, but the gap is upstream. A stronger model on the same bad context produces a more confident wrong answer slightly more often than a weaker model on the same bad context. This is the reason the Debugger's most useful exit, in practice, is not a more expensive retry inside the Coder loop. It is the early route back to the Planner when the manifest is the actual problem. That route does not escalate the model. It moves the fix upstream, where the bad input was generated. Most of the time, that is what unsticks the run. It does not mean the escalation rungs are useless. There are real cases where the cheaper call gave a vague diagnosis, the more expensive call gave a precise one, and the next attempt landed. Those cases are real and the rung pays for itself when they happen. They are also a smaller share of the stuck runs than the design instinct suggests. The honest reading is that the schema lets you escalate the model as a fallback. The more useful design move, by a wide margin in observed runs, is the upstream route. ## The final hard stop When every retry budget has been spent and the failure has not resolved, the pipeline halts and resets. There is no fourth Coder attempt, no second Planner re-plan, and no automatic switch to a different model. The run fails, the failure surfaces, and the ticket is flagged for human review. The reset is part of the halt, not an afterthought. The workspace is restored to the state it was in before the run began. Partially-applied edits are reverted, half-written tests are dropped, orphan files are cleaned up. A failed run leaves nothing behind for the next ticket to inherit. The next ticket runs against the same starting point as if the failed run had never happened, which is the only way a ticket queue can survive an unattended failure without compounding the damage across runs. This is intentional, and the thing it is preventing is the headline. A pipeline that retries until it succeeds, with no exhaustion path, is a pipeline that loops on its own bill until something external trips it. The retry budgets, the wall-clock timeouts, and the global token cap exist so the loop ends. When the budget is gone, the run stops, every time, regardless of how close the next attempt looks. The escalation rungs inside those budgets are tradeoffs about where to spend the next attempt, not promises that some attempt will eventually work. > The contract is not there to make stuck runs fast. It is there to stop them from becoming endless. The reason this is worth describing in detail is that "what happens when each stage fails" is the part of an agent pipeline that actually has to work in production. The happy path is a presentation. Those pieces, the retry budgets, timeouts, escalation rungs, and hard stops, live in one config file because they are the same kind of decision applied to different stages, not because the file was easier to organise that way. --- *The pipeline runs inside Docker on real tickets against a ~100k line TypeScript monorepo. Still R&D. The design favours fixing bad upstream context over throwing a stronger model at a stuck stage; the schema supports both, and the upstream route has been more reliable.* ================================================================================ # Prompt Rules Are Advisory; Validators Are Binding URL: https://dontcodethisathome.com/prompt-rules-are-advisory-validators-are-binding Date: 2026-04-27 Description: When the model has a strong prior, naming the failure mode in the prompt doesn't prevent it. Prompt rules are advisory; validators are binding. Tags: AI Engineering, Autonomous Engineering Pipeline, Agent Safety Dependencies: AI Agent Pipelines, Prompt Engineering, Specification Gaming Three pieces of text from a single Planner call this morning, in order. The user's request, as raw prose in a feature ticket fed to an AI coding agent: > Some items in my queue are already pre-approved and don't need the standard processing step. I'd like a way to move them from Pending to Active without running the full workflow. **I don't need a button for this.** A new rule in the Planner's prompt, added after the previous run shipped a button despite the user's exclusion: > **Honor exclusions by category, not by literal mention.** Failure modes this rejects: > > - *"I don't need a button"* → keyboard shortcut. (Excluded class: new UI; pivoted to a different UI mechanism.) > - *"Don't change the database"* → JSON file alongside. (Excluded class: persistence change; pivoted to a different storage form.) The Planner's response, on the very next run, verbatim from the rationale field: > The user explicitly requested no new button, so the feature is exposed as a keyboard shortcut in the existing hotkey system. ![A three-panel transcript card. Panel 1, labelled "User prose", shows the user's ticket ending with "I don't need a button for this." Panel 2, labelled "Planner prompt", shows the anti-pattern rule: "Honor exclusions by category, not by literal mention" with the keyboard shortcut example marked as an excluded pivot. Panel 3, labelled "Planner response", shows the rationale field: "The user explicitly requested no new button, so the feature is exposed as a keyboard shortcut in the existing hotkey system." A red badge reads "Read the rule. Did it anyway."](https://dontcodethisathome.com/images/prompt-rules-are-advisory-validators-are-binding.png) The Planner read the user's exclusion. It read the rule that named "no button to keyboard shortcut" as a forbidden pivot. It then pivoted to a keyboard shortcut, citing the user's exclusion of buttons as justification. This is not a comprehension failure. It is a binding failure. > Prompt rules are advisory; validators are binding. For the class of failure where the model has a strong prior (here, "complete features ship a way to use them"), the advisory loses to the prior every time, even when the advisory explicitly names the exact failure. This is what specification gaming looks like in production LLM agent pipelines: the model reads the constraint, finds an interpretation that lets it do the thing it was already going to do, and explains its reasoning back to you in the rationale field. The text is helpful. The behaviour is not. ## Why the Planner has a strong prior here The model's prior is "features need user-facing surface." It came from training on codebases where every "add feature X" almost always involved user-facing surface, because most of the code it read had user-facing surface. Asking it to skip the user-facing surface runs counter to a pattern baked into the weights. A prompt hint runs counter to a learned pattern. The learned pattern wins. That is not a failure of the prompt phrasing. I tried tighter phrasings across three runs: 1. The user's bare prose ("I don't need a button"). The Planner shipped a button on `QueueItemPanel.tsx`. 2. The user's prose plus a Decision-principles rule that taught the model to interpret exclusions broadly. The Planner shipped a keyboard shortcut on `QueueItemPanel.tsx`. Different file, same UI scope. 3. The user's prose plus a sharper rule that explicitly named "no button to keyboard shortcut" as the forbidden pivot. The Planner read the rule and shipped a keyboard shortcut anyway, this time on a different file (`useKeyboardShortcuts.ts`), and explained in the rationale that the user had asked for no buttons. The prompt held up through three rephrasing attempts and the prior won every time. ## Why prose-patching the user's input is also wrong The first instinct after run 1 was to ask the user to reword their request. "Add 'no UI changes' instead of 'no button.'" This is the wrong end to fix. The user is non-technical. They named the most concrete example ("button") of the class they did not want. Asking them to enumerate the class (button, keyboard shortcut, menu item, hotkey, command palette, drag-and-drop, deep link, context menu) defeats the purpose of accepting natural-language tickets at all. They cannot anticipate the model's pivot space; the model's pivot space is much larger than any individual user's vocabulary. Worse, every prose patch is one rephrasing the model can reinterpret. "No UI changes" reads to the model as a hint to find a non-UI mechanism that produces the same user-visible result. "Backend only" reads as a hint that the user wants a backend implementation but is open to a thin UI shim. There is no prose patch that survives a determined pivot. ## What worked I stopped patching the user's prose. I stopped patching the Planner's prompt. I extracted the constraint from prose into structured data and put a validator on the back end. **A pre-extraction stage.** A separate, cheap LLM call reads the user's prose and emits a structured field listing the categories the user has excluded, drawn from a fixed vocabulary of change categories: `ui`, `api`, `persistence`, `config`, `build`. This stage's only job is to interpret prose. Its output is a fact, not a suggestion. **A structured grounding block in the Planner's prompt.** The Planner no longer sees raw user prose with an instruction to interpret it. It sees a structured section near the top of its inputs: ```text ## USER-EXCLUDED CATEGORIES The user's user_intent excludes the following categories of changes (extracted by the pre-emission stage). These are FACTS, not your interpretation, not negotiable. ### Excluded categories - ui ### Excluded file patterns (validator-enforced) - **/components/** - **/pages/** - **/*.tsx - **/client/** ... ``` The Planner does not interpret prose. The interpretation already happened. The Planner's job is to emit a ticket that respects the structured fact. **A post-emit validator that hard-rejects.** When the Planner emits a draft, a validator walks every file path in the output and matches it against the excluded patterns. Any match returns a structured error pointing at the specific field, the offending path, and the matched pattern: ```text excluded_file_in_ticket at $.sub_tickets[0].files[2].path: File 'src/client/components/Foo.tsx' matches excluded pattern '**/client/**'. The user_intent excluded this category. Drop this file from the sub-ticket. Do NOT pivot to a different file in the same class as a way to claim compliance. ``` The error folds into the Planner's retry context. The Planner sees the violation, sees the canonical pattern it matched, and re-emits. Three pieces, but the architectural move is the validator. The pre-extraction stage and the structured grounding make the validator's job easier (the Planner more often gets it right on attempt 1). The validator is what binds the constraint when the prior wins anyway. ## The next run, with structural enforcement Same user prose. Same Planner. Same model. Now there is a structured `## USER-EXCLUDED CATEGORIES` block in the prompt and a validator on the back end. The Planner emitted a backend-only ticket on attempt 1. No UI files. No button. No keyboard shortcut. The rationale field even acknowledges the constraint: *"No UI change is needed; the flag is passed through the existing HTTP API."* The validator never fired. Once the constraint was in the prompt as a structured fact instead of prose for the Planner to interpret, the Planner respected it. The validator is still there. It will fire on the run where the Planner finds a new pivot the structured grounding does not cover. That is the point: the binding mechanism is the validator, not the grounding. The grounding reduces retry cost. The validator carries the load. ## Generalising the pattern This is the same architectural template I have applied to four different problems in the same pipeline. The shape is always: 1. Identify a fact the Planner keeps getting wrong. Same recurring failure across three or more runs is a fact, not a fluke. 2. Find a deterministic source of truth: Tree-Sitter walk, [registry lookup](https://dontcodethisathome.com/stop-asking-the-model-what-the-code-already-knows), prior LLM stage with structured output, file scan. 3. Extract the fact and present it as data in the Planner's prompt: compact format, headers the Planner can scan, no JSON dumps when a tab-separated row would do. 4. Validate the Planner's output against the fact. Hard-reject with a specific error. Errors fold into the retry context. Capture types in test fixtures. User-stated exclusions. Canonical table names from the registry. Fixture import paths. Same pattern, different fact axis. Each one of these started with a prompt rule. Each one of them ended with a validator. The prompt rules never worked alone for any of them. ([An earlier post on this blog](https://dontcodethisathome.com/correct-code-wrong-file-how-the-write-gate-contains-scope-creep) covers the same structural enforcement applied to the Coder's scope: a list the Coder cannot argue its way past, checked before anything hits disk.) ## When soft instruction is enough Not every Planner mistake calls for a validator. Soft instruction works when the model has no strong prior pulling against the constraint, the cost of a single violation is low (downstream review catches it), and the class of mistakes is narrow enough that a few examples in the prompt are exhaustive. For these cases, prompt-tune away. Most prompt engineering is genuinely about phrasing. For the cases where the model has a strong prior, and you will know because the same mistake keeps recurring even after you have named it explicitly in the prompt, soft instruction is wasted text. You need a validator. ## The checklist For each LLM-authored field in your output schema, ask: 1. **Can the pipeline derive this?** Tree-Sitter walks, AST queries, registry lookups, deterministic file scans. If yes, derive it. The Planner does not author what the pipeline can compute. 2. **If the Planner has to author it, is the input structured or prose?** If prose, consider a pre-extraction stage that turns prose into structured input. The Planner reads facts, not interpretations. 3. **Can a deterministic check verify the output?** If yes, validate. The Planner self-corrects on attempt 2. 4. **Only after exhausting 1 to 3:** prompt-tune. Most teams skip 1 to 3 and go straight to 4. They patch prose for months while [the same failure keeps coming back](https://dontcodethisathome.com/per-field-hallucination-fixes-hit-a-ceiling-248-runs-on-an-ai-coding-agent). The keyboard-shortcut transcript is what that looks like at the limit: the Planner reads the anti-pattern rule and does it anyway, in the same response, citing the user's request. The fix is to take the constraint out of prose and put it into structure. Once it is structure, the Planner cannot lawyer it. This is still R&D. The pre-extraction stage is one LLM call against a fixed vocabulary; the vocabulary has five categories today and will need per-project tuning when the next non-default project hits the pipeline. The validator's pattern set is JS/TS-shaped today and will need per-language extensions. The architectural seam is in place; the implementation is single-stack. Both gaps are tracked and triggered by the next consumer that hits them. ================================================================================ # How Filename Lookups Flood an AI Coding Agent's Context Window URL: https://dontcodethisathome.com/how-filename-lookups-flood-an-ai-coding-agents-context-window Date: 2026-04-21 Description: Looking up symbols by filename instead of full path pulls every `index.ts` in the project into the agent's context. One line changed. 20 results down to 1. Tags: AI Engineering, Autonomous Engineering Pipeline, Agent Harness Dependencies: Symbol Registry, TypeScript, Monorepo The Coder's context included twenty barrel re-exports when it needed one. ## The problem The pipeline enriches a task manifest with full symbol details from a symbol registry. For each file in the manifest's [`files_to_modify` list](https://dontcodethisathome.com/correct-code-wrong-file-how-the-write-gate-contains-scope-creep), the enrichment step queries the registry with the path fragment from the manifest to get line ranges, types, and metadata. The Coder uses this to know exactly where to make changes. The manifest said `"path": "packages/api/src/server/pipeline/index.ts"`. The enrichment extracted the filename: `index.ts`. The registry returned every symbol in every file named `index.ts`. The target codebase is a seven-package TypeScript monorepo with package-root barrels everywhere: - `packages/shared/src/index.ts` - `packages/shared/src/parsers/index.ts` - `packages/shared/src/integrations/index.ts` - `packages/api/src/server/index.ts` - `packages/api/src/server/routes/index.ts` - `packages/api/src/server/db/index.ts` - `packages/api/src/server/pipeline/index.ts` - `packages/api/src/server/pipeline/steps/index.ts` - `packages/api/src/server/services/backup/index.ts` - … eleven more Twenty files. Nineteen of them are noise. A larger enterprise TypeScript estate can easily have fifty or a hundred of these barrels, one per package-root and one per directory a team decided was package-shaped. ## Why it matters Each enrichment result adds tokens to the Coder's context. A barrel is small on its own, but twenty of them together is several thousand tokens of `export * from "./..."` lines the Coder should never see. On a hundred-barrel codebase it is over ten thousand tokens of re-exports dragged in for every ticket, every time (a different cause from [the redundant tool-call inflation](https://dontcodethisathome.com/how-claude-p-silently-inflates-your-pipeline-token-costs), but the same symptom). Worse, the twenty files look structurally identical to the Coder: same filename, same re-export pattern, different contents. If the manifest's path scope is ambiguous in any way, the Coder can read the wrong barrel and produce code that targets the wrong package. It has not happened yet (the `path` field has been clear enough), but the risk is structural, not a near miss. ## The fix One line changed in the enrichment code. The enrichment was extracting just the filename from the full path before passing it to the registry lookup, discarding the specificity already in the manifest. Passing the full path directly resolved it. The registry matches on path substrings, so `packages/api/src/server/pipeline/index.ts` matches exactly one file while `index.ts` matches all of them. Twenty results down to one: same call, same registry, same lookup, just a longer search string. > The bug is not that filenames are a bad key. It is that filenames were ever used when full paths were sitting right there in the manifest. ## The principle This is a context-pollution pattern that appears anywhere filenames are reused: - `index.ts` / `index.js` in every package-root barrel (TypeScript monorepos, Express projects) - `__init__.py` in every Python package - `main.go` in every Go service - `mod.rs` in every Rust module If your RAG pipeline or context-injection system looks up symbols by filename instead of full path, you are pulling in every symbol from every file with that name. The noise scales linearly with project size. The fix is always the same. Use the most specific identifier available. If you have a full path, use the full path. Do not discard specificity for convenience. ![Two-panel diagram contrasting filename lookup with full-path lookup. Left panel shows a query for index.ts branching to five result files (shared/src/index.ts, shared/src/parsers, api/src/server, pipeline, services/backup) plus a label indicating fifteen more, totalling 20 matches. Right panel shows a query for packages/api/src/server/pipeline/index.ts resolving to exactly one file.](https://dontcodethisathome.com/images/filename-lookup-fanout.svg) ## Polyglot by construction The fix does not check the file extension, the language, or the framework. It uses whatever path string the manifest put in. Works for `packages/api/src/server/pipeline/index.ts`, `src/auth/views.py`, `internal/handlers/search.go`, or `src/routes/mod.rs`. No language-specific logic needed: the specificity comes from the manifest, not from the enrichment code. Anywhere the pipeline has a choice between a more-specific and a less-specific identifier, the more-specific one is almost always free. It is usually already in the data. --- *The pipeline runs on real tickets against a ~100k line TypeScript monorepo. Still R&D.* ================================================================================ # The Lego Instructions: An Architectural Principle for AI Coding Agents URL: https://dontcodethisathome.com/the-lego-instructions-an-architectural-principle-for-ai-coding-agents Date: 2026-04-19 Description: Three properties of a Lego instruction set, mapped to an AI coding pipeline: why manifest quality matters more than builder quality. Tags: AI Engineering, Autonomous Engineering Pipeline, Agent Harness Dependencies: Language Server Protocol, Tree-Sitter Most AI coding agents improvise. A ticket arrives, the agent reads it, opens files it guesses are relevant, writes code, runs tests, and if the tests fail, tries again. That loop has a name in the popular imagination ("autonomous") and a track record that is roughly what you would expect from an engineer who treats every ticket as a fresh puzzle to explore. There is a different shape available. Instead of improvising, follow instructions. Instead of exploring, execute. ## Three properties of a Lego instruction set **Every piece is in the box.** Step 14 never says "find a blue 2×4 somewhere in your collection." It tells you which bag it came from, what colour it is, what shape. The builder never hunts. **Every step is unambiguous.** The instruction shows the exact piece, the exact position, the exact orientation. The builder never picks between two plausible interpretations. **No assembly is impossible.** The instruction designer verified fit before publishing. No step asks you to connect two pieces that physically cannot click together. Those three properties map one-to-one onto an agent pipeline. ![Two-column mapping diagram. Left column lists the three properties of a Lego instruction set: every piece in the box, unambiguous steps, no impossible assemblies. Right column lists their pipeline equivalents: a Synthesizer populates facts from the registry, the language server, and Tree-Sitter; each manifest field has a single source of truth; a feasibility gate rejects unsatisfiable criteria before any builder runs. Arrows connect each matched pair.](https://dontcodethisathome.com/images/lego-instructions-pipeline.svg) ## Mapped to the pipeline *Every piece in the box* becomes: file paths, symbols, signatures, mock targets, and dependency resolutions are populated into the manifest by a Synthesizer (drawn from [a symbol registry, a Tree-Sitter parse, and a language server](https://dontcodethisathome.com/stop-asking-the-model-what-the-code-already-knows)), not requested from the Planner. The builder does not search. *Unambiguous steps* becomes: each manifest field has a single source of truth. The Coder never chooses between two plausible interpretations of a field, because only one value is present. *No impossible assemblies* becomes: a feasibility gate resolves each acceptance criterion against the registry and the language server before any builder runs. Criteria that no test could satisfy on the current source topology (for example, "assert that `foo` does not call `bar` when both live in the same module and are therefore unmockable") are rejected at manifest-validate time, with a structured alternative, before a single token of builder work is spent. The rule that falls out: if a builder is searching, locating, or deciding, that is brain work leaking downstream. The manifest was incomplete. The fix is to push the work back up to the stage with the context to answer deterministically, not to prompt the builder harder. ## Why it matters The popular alternatives point the other way. AutoGPT-style loops hand the model a goal and let it improvise its way to an answer, which is the "find a blue piece somewhere" architecture scaled to a whole project. Devin's pitch of an autonomous software engineer is structurally closer to "hire a smart builder" than "give the builder good instructions," which is a reasonable bet if the bottleneck is builder quality. Measured on [248 real pipeline runs](https://dontcodethisathome.com/per-field-hallucination-fixes-hit-a-ceiling-248-runs-on-an-ai-coding-agent) against a TypeScript fixture, the bottleneck is manifest quality, not builder quality. > A better builder cannot follow an instruction that names a piece that is not in the box. ## One note on timing The [248 runs post](https://dontcodethisathome.com/per-field-hallucination-fixes-hit-a-ceiling-248-runs-on-an-ai-coding-agent) was written while the feasibility gate was still on paper. That post and two others described it as "an active design, not a shipped subsystem." It has since shipped. The measured outcome of the delivered architecture belongs in that post's follow-up. This post is about the principle. --- *The pipeline runs on real tickets against a ~100k line TypeScript monorepo. Still R&D.* ================================================================================ # Per-Field Hallucination Fixes Hit a Ceiling: 248 Runs on an AI Coding Agent URL: https://dontcodethisathome.com/per-field-hallucination-fixes-hit-a-ceiling-248-runs-on-an-ai-coding-agent Date: 2026-04-17 Description: Bernoulli model predicted 36% first-pass success across 248 pipeline runs. Measured: 21%. The gap explains why per-field hallucination fixes have a ceiling. Tags: AI Engineering, Autonomous Engineering Pipeline, LLM Hallucination Dependencies: Tree-Sitter, Language Server Protocol, TypeScript The simple model for LLM hallucination in an AI coding agent pipeline is a Bernoulli trial. Each field the Planner emits into a manifest is an independent roll of a weighted die. At twenty fields and a 0.95 per-field success rate, the joint probability of a fully correct manifest is 0.95 to the twentieth, roughly 36 percent. (An earlier post originally quoted this as about 60 percent, which is the math for ten fields; 36 percent is the correct number for twenty. That post is being corrected alongside this one.) [That earlier post](https://dontcodethisathome.com/stop-asking-the-model-what-the-code-already-knows) used this framing to argue that per-third-ticket failure in an agent pipeline is a schema problem, not a model problem. I went back through the run archive to see how close the model came to reality. Across 248 tickets run through the pipeline between April 1 and April 14, 2026, the first-pass success rate was 21 percent. The prediction missed by 15 points. The gap has two plausible explanations, and both are probably true. Either per-field accuracy is lower than 0.95 in practice, or the fields are not independent. One wrong fact cascades into multiple broken acceptance criteria, and a single hallucination in the manifest takes several downstream checks with it when it fails. Together they set a ceiling on how far per-field accuracy work alone can carry the pipeline. ## What the 248 runs actually show The data covers several architectural iterations of the pipeline, so it is a historical aggregate, not a controlled A/B. The tickets come heavily from a small fixture family, which means the absolute rate is fixture-contingent. A project with a deeper import graph could push it higher; a simpler one could push it lower. The shape of the result, which classes of failure dominate, does not depend on the fixture. Field-level hallucination is a property of how the Planner interacts with the manifest schema, not of what any particular ticket asks for. The breakdown: - 21 percent (53 of 248) completed without the Debugger running at all. The manifest was accurate enough that the Test Writer, the Coder, and the Reviewer produced passing, reviewed code on first pass. - A further 3 percent recovered after the Debugger diagnosed the failure and the Planner re-planned. Total end-to-end completion: 24 percent. - 27 percent reached the Debugger and failed anyway. The Debugger named a cause, but no valid manifest emerged within the attempt budget. - The remaining 49 percent stalled before the Coder stage. Planner timeouts, pre-scope selection failures, iteration pauses. Not all are builder failures, but none produced working output. ![Horizontal stacked bar chart breaking 248 pipeline runs into four segments: 21 percent first-pass complete (53 tickets), 3 percent recovered after a Debugger re-plan (7 tickets), 27 percent Debugger-diagnosed failures (67 tickets), and 49 percent stalled before the Coder stage (121 tickets). A callout above reads that the Bernoulli hallucination model predicted 36 percent first-pass success.](https://dontcodethisathome.com/images/248-runs-breakdown-bar.svg) ## Why per-field fixes have a ceiling Of the 67 tickets where the Debugger diagnosed a failure in detail, about two-thirds were field-level hallucinations. The Planner emitted a wrong filepath, a missing export, a bad function signature, or a type field that did not exist on the object the test was written against. The Coder followed the bad instruction faithfully into a wall. This is exactly the class of failure machine extraction is designed to eliminate, and on the fields it has been applied to, it does. Every field moved from "the Planner emits it" to "the registry computes it" is a dice roll that never has to be re-rolled. > Every field moved from "the Planner emits it" to "the registry computes it" is a dice roll that never has to be re-rolled. But not every failure is a bad dice roll. The other third is where the ceiling comes from. Some failures are instruction ambiguity. The Planner's scope field leaves room for the Coder to interpret "implement this new function" as "replace the file with this new function." One such incident took out five working exports from a single module and cascaded into 44 failing tests across three suites. That is not a field-accuracy problem. Rewriting the filepath field does not stop it. Others are structurally unsatisfiable criteria. [An earlier post](https://dontcodethisathome.com/stop-asking-the-model-what-the-code-already-knows) used the `vi.mock` same-module-binding example: a test criterion that no builder can satisfy regardless of which field values are plugged in. Machine extraction of the mock target does not fix that either. At 20 fields and 0.95 per-field accuracy, moving five fields from LLM emission to machine extraction lifts the joint probability from 36 percent to 46 percent on the model. That is a real gain. But the observed first-pass rate already sits below the 36-percent prediction, and the failures that will remain after per-field work are not on the same curve. Another round of per-field fixes will not close the gap. The remaining third needs [structural validators](https://dontcodethisathome.com/prompt-rules-are-advisory-validators-are-binding), not tighter prompts. ## The Lego Instructions A Lego instruction set has three properties: 1. **Every piece is in the box.** You never need to find a piece. The instruction tells you exactly which bag it's in, what color, what shape, and where it connects. The builder never hunts. 2. **The instructions are unambiguous.** Step 14 never says "attach something blue-ish somewhere near the top." It shows the exact piece, exact position, exact orientation. The builder never decides. 3. **Impossible assemblies don't appear.** No instruction asks you to connect two pieces that physically can't click together. The instruction designer verified fit before publishing. Mapped to an agent pipeline: - Every piece in the box: file paths, symbols, signatures, dependency resolutions are machine-extracted and populated into the manifest by a Synthesizer, not requested from the Planner. - Unambiguous steps: each manifest field has a single source of truth. The Test Writer and the Coder never choose between two plausible interpretations of the same field. - No impossible assemblies: a feasibility gate resolves each acceptance criterion against the registry and the language server before any builder fires, and rejects criteria that no test could satisfy on the current source topology. The principle that follows is that the Planner's job is to make the builders' job trivial. Whenever the Coder is searching for a file, locating a symbol, or picking between two plausible interpretations, that is brain work leaking downstream. The manifest was incomplete. The fix is to push the work back up to the stage that has the context to answer the question deterministically, not to prompt the builder harder. The architectural reasoning behind this principle is developed in [The Lego Instructions](https://dontcodethisathome.com/the-lego-instructions-an-architectural-principle-for-ai-coding-agents). ## What comes next The design currently on paper splits the manifest into two disjoint surfaces. One surface holds the Planner's actual decisions: what to build, which symbols are in scope, what the acceptance criteria are. The other surface holds the facts: file paths, signatures, dependency resolutions, mock targets. The Planner only writes to the first. A Synthesizer populates the second from the registry, the language server (LSP), and Tree-Sitter before any builder sees the manifest. A feasibility gate runs in parallel, resolving each acceptance criterion against the same sources and rejecting unsatisfiable ones at manifest-validate time, with a structured alternative, before a single token is spent on the Test Writer. ![Flow diagram of the proposed split-surface manifest pipeline. The Planner emits only decisions (scope, acceptance criteria, sub-ticket plan). A Synthesizer populates fact fields (file paths, signatures, dependency resolutions, mock targets) from the symbol registry, the language server, and Tree-Sitter. A feasibility gate resolves each acceptance criterion against the same sources and rejects unsatisfiable criteria before the Test Writer, Coder, and Reviewer run.](https://dontcodethisathome.com/images/split-surface-manifest-diagram.svg) The gate is on paper. The registry, the language server, and the call-graph primitives are already live and already used to populate individual fields. The thing being built next is the gate itself and the split-surface manifest that feeds it. Whether it holds up on the next 248 runs is the measurement worth the follow-up post. --- *The pipeline runs on real tickets against a TypeScript fixture and a ~100k line TypeScript monorepo. The 248 runs are drawn from the pipeline's run archive. The feasibility gate described above has since shipped; the measured results belong in a follow-up post. Still R&D.* ================================================================================ # Stop Asking the Model What the Code Already Knows URL: https://dontcodethisathome.com/stop-asking-the-model-what-the-code-already-knows Date: 2026-04-16 Description: Every field a Planner emits that the codebase already knows is a dice roll. Machine extraction replaces those dice rolls with deterministic lookups. Tags: AI Engineering, Autonomous Engineering Pipeline, Agent Harness Dependencies: Language Server Protocol, Tree-Sitter, TypeScript [Non-determinism in an LLM-based agent pipeline](https://dontcodethisathome.com/llm-non-determinism-is-a-pipeline-failure-not-a-model-problem) is usually framed as a model problem. Better model, better prompt, better temperature setting. In practice, most of the non-determinism I see in a code-generation pipeline is a schema problem: the pipeline asks the model to emit fields the codebase already knows, and every one of those fields is a chance for the model to hallucinate. Every field an agent pipeline asks an LLM to emit is a Bernoulli trial. At ninety-five percent per-field success and twenty such fields per manifest, the joint probability of a fully correct manifest is about 36 percent. (Originally published as 60 percent, which is the correct figure at ten fields, not twenty. Corrected April 2026 alongside a [follow-up post measuring 248 runs against the model](https://dontcodethisathome.com/per-field-hallucination-fixes-hit-a-ceiling-248-runs-on-an-ai-coding-agent).) That matches what it feels like when every third ticket fails for no obvious reason. The fix is not a better model or a tighter prompt. It is to stop asking the model for fields the codebase already knows, and to supply those fields directly from a symbol registry, a Tree-Sitter parse, and a language server (LSP) instead. ## Facts versus decisions The rule for deciding which fields a Planner should emit is simple. Every fact a model copies is a fact it can hallucinate. The Planner decides *what* to build. The pipeline supplies *what exists*. When the two conflict, machine-extracted data wins over model-generated constraints, which win over model narrative. Most of the fields a Planner currently emits are facts, not decisions: the path a symbol lives at, the signature of the function it calls, the module a dependency resolves through. The Planner is making judgment calls about decomposition and intent. In the same output it is also being asked to remember the repo's file layout from the prompt it was given. It does the first job well and the second job probabilistically. The probabilistic part is where every third manifest fails. The infrastructure to make those fields deterministic is already in most coding agents' stacks. It is just not wired to replace the model's emission of them. Tree-Sitter gives an abstract syntax tree (AST) with exact byte ranges. A symbol registry built on top of it gives O(1) symbol-to-file lookups. A language server gives cross-file semantics: definitions, references, incoming calls, type resolution across barrel re-exports. Between those three primitives, most of the fields a Planner hallucinates are already sitting in process, addressable by a lookup. ![Diagram of the Planner's output schema split into two columns. The left column lists fields the model still decides, including title, summary, sub-ticket decomposition, the symbols that change, and acceptance criteria. The right column lists fields the pipeline now supplies from the registry and the language server, including symbol paths, dependency signatures, blast-radius callers, mock targets, files-to-read list, and test layer. A full-width arrow below both columns points left with the label "never emitted, never hallucinated."](https://dontcodethisathome.com/images/machine-extraction-planner-schema-diagram.svg) ## The case where machine extraction worked: blast-radius callers The first field removed from the Planner's output was the list of blast-radius callers. Before the change, a Planner reading a ticket like "change how `getAllPosts` returns data" had to guess which files called `getAllPosts`. Sometimes it got it right from context. Sometimes it missed `search.ts` because the ticket did not mention search, and the Coder shipped a return-type change that broke the type checker eleven places over. The Debugger then spent thirty thousand tokens recovering from a question the language server can answer in one request. After the change, the registry holds the symbol, the language server holds the callers, and the Planner is never asked to emit the list. The pipeline reads it. The run that proved it was a pagination ticket on a TypeScript blog fixture. Pre-LSP: eleven typecheck errors in `search.ts` that the Planner had missed as a caller, a Debugger cycle, 92,618 tokens, 122 seconds. Post-LSP: zero `search.ts` errors, clean typecheck, 95,170 tokens, 97 seconds. The blast-radius overhead of thirty-three thousand extra input tokens and seventy-four extra seconds disappeared from the post-LSP run. The Debugger still fired on the second run, but on a genuine logic error, a validation check placed after a clamping function so it could never fire. That is the kind of bug a Debugger is for. Type errors from a missed caller are not. A separate earlier post ("[What Calls This Function?](https://dontcodethisathome.com/what-calls-this-function-why-ai-coding-agents-need-a-language-server)") goes through the full before-and-after for readers who want the stage-by-stage breakdown. The pattern generalises. Any time the Planner is emitting a fact that Tree-Sitter, the registry, or the language server can compute directly, the field can be removed from the prompt's output schema entirely. The model does not emit it. The pipeline materialises it. There is no version where the model gets it wrong, because the model is not the one answering the question. ## The case still teaching me where the line is: same-module lexical bindings Not every field is as clean to mechanise as blast-radius callers. A recent run produced a manifest that declared a mock target at the module path `vi.mock("@api/pipeline/index", ...)`. That module is a barrel re-export of `./processor`. The test then imported the real `processItem` from the source module and asserted on mock call counts for its siblings `summarizeItem` and `buildOutput`. The pipeline shipped a correct Coder implementation. Two tests still went red: `processItem`'s internal call to `summarizeItem` is a same-module lexical binding, and `vi.mock` cannot intercept those regardless of which module it is pointed at. The cost of discovering this at test-run time rather than at manifest-validate time was a full Test Writer cycle, a full Coder cycle, and two retry attempts on tests that could never go green on this source topology. Roughly forty thousand tokens and several minutes of wall clock, per incident, burned on work that was doomed from the moment the criterion was written. Multiply that by how often the class fires across a real project and the case for catching it pre-builder is straightforward. The obvious-looking mechanisation, "let the pipeline derive the mock target from the registry," does not fix this. Pointing the mock at the source module either leaves the binding problem intact, or, if it also replaces `processItem` with a spy, silently breaks the test's real import. Both paths are structurally unsatisfiable. The actual defect is not in the mock-target field. It is in the shape of the criterion. An assertion that a function "does not call `summarizeItem`" is unachievable on this source topology. The fix is to reject the criterion at manifest-validate time with a structured alternative. Rewrite as an HTTP status assertion. Rewrite as a database state assertion. Drop as out of scope. The only way to reject that deterministically is to resolve the subject and object of the criterion against the registry and check whether they share a source file. When they do, no mock target saves the test. That is still machine extraction. It just operates on the criterion's semantics, not the manifest's surface syntax. And it is the piece I am currently designing rather than running. Honest statement of where this sits at the moment of writing. The registry, the reach data, and the language server are live. The feasibility gate that consumes them to reject unsatisfiable criteria is on paper. The scoped plan I was going to start from, mechanise one field, the mock target, turned out not to close this failure class. Working through why produced the design I actually want to build. ## How machine extraction differs from output validation Machine extraction is not the same as "[add a validator](https://dontcodethisathome.com/prompt-rules-are-advisory-validators-are-binding)." Validators run after the model has emitted a value and ask whether it was right. A good validator is a backstop. A great pipeline does not need it most of the time. Machine extraction means the model never emits the value. The field is not in the prompt's output schema. There is no value to validate, because there was no choice to make. > That shift, from "check the model's work" to "do not give the model the work," is where the determinism comes from. A validator at ninety-five percent recall still lets the three-percent class through. A field removed from the output surface lets zero through, because the surface is smaller. A concrete example of the replacement. The pipeline used to emit a list of "files the Coder needs to read" from the Planner, and a validator checked that every called symbol's defining file was in the list. Missed entries meant a rejection and a repeat Planner call. The validator caught most of them. It also loved to argue with the Planner about files the Planner had correctly judged as irrelevant and the registry said otherwise. The field no longer exists in the Planner's output. The list is computed directly from the registry's call graph, starting at the symbols the Planner decided to modify. The validator went with it. Nothing to argue about, because nothing was emitted. Every LLM-emitted field also carries a maintenance cost. When the schema changes, the prompt changes, the validator changes, the retry logic changes, the regression tests change. A field removed from the output surface is one fewer moving part in every one of those places. That is a separate win from the determinism one, and it compounds in the same direction. The risks are real and worth naming. A field can look mechanical but turn out to carry a legitimate decision, and removing it strips judgment from a place that needed it. A field can be pre-populated correctly on the fixture it was tested against and wrong on the first polyglot project. Both have happened here. The answer is the same as with any other architectural constraint. Name what the primitive covers. Name what it does not. Write the incident down. Decide next time with the catalogue in hand. The direction, though, is one-way. Every field moved from "the model emits it" to "the registry, the language server, or the syntax tree supplies it" is a dice roll that never has to be re-rolled. The next two fields on the list are the mock target, which moves once the feasibility gate has a first version running, and the dependency signatures, which the registry already stores and the Planner currently rewrites from the context it was given. Over a long enough pipeline, the compounding is the whole game. The architectural principle behind that direction is [The Lego Instructions](https://dontcodethisathome.com/the-lego-instructions-an-architectural-principle-for-ai-coding-agents). --- *The pipeline runs inside Docker on real tickets against a TypeScript fixture and a ~100k line TypeScript monorepo. Numbers and incidents are from actual runs. The feasibility-gate piece is an active design, not a shipped subsystem, and is called out as such where it matters. Still R&D.* ================================================================================ # Why Architecture Gaps Need a Close Condition, Not a Backlog URL: https://dontcodethisathome.com/why-architecture-gaps-need-a-close-condition-not-a-backlog Date: 2026-04-15 Description: Why tracking known architectural gaps with specific close conditions is more useful than a backlog, and what makes each entry work. Tags: AI Engineering, Autonomous Engineering Pipeline Dependencies: Autonomous Engineering Pipelines Most software projects track technical debt in a backlog. Few track it with a record of why each problem exists, what risk it carries, and under what specific condition it must be closed. The architecture-gaps document is the second thing. I have been maintaining one for an autonomous engineering pipeline, an AI coding agent that plans, writes tests, writes code, and reviews its own output. A previous post on this blog ([Fixture-First Development as an Early Warning System for AI Pipelines](https://dontcodethisathome.com/fixture-first-development-as-an-early-warning-system-for-ai-pipelines)) introduced the catalogue-vs-backlog distinction in the context of a specific run. This post is the deeper look at what makes the document itself work. The document currently has over 240 entries. About half are closed, some are deferred, and the rest are open. Every entry follows the same structure, and that structure is the reason the document works. ## What a gap entry requires A useful gap entry has three fields that cannot be omitted. Without all three, the entry degrades into a backlog item that gets periodically reviewed and periodically deferred. **What the gap is.** Not a title. A description of the specific deficiency, including what was observed when the gap was discovered. "The Coder appended a duplicate symbol on retry instead of replacing the existing one. File grew past 1,600 lines. Next attempt timed out." That is not a summary. It is the incident that made the gap visible, preserved so the next reader understands the severity without having to reproduce it. **Why it is not closed yet.** This is the field most backlogs omit. "Fix this later" is not a reason. The reason has to be specific. "The fixture files are 50 to 200 lines, small enough that the workaround holds. The risk is deferred to the first production-sized file." That tells the reader exactly what is being bet on and what would break the bet. The reader can disagree, but the disagreement is informed. **The close condition.** This is what turns a gap into a tripwire rather than a wish. A close condition is not "eventually." It is a concrete trigger. "Must resolve before the pipeline runs against any file over 500 lines." When the trigger fires, the gap is no longer deferrable. [The decision to defer was correct at the time it was made](https://dontcodethisathome.com/intentional-technical-debt-building-features-in-the-wrong-order), and the close condition is how the document knows when "at the time" has expired. ![Diagram of a single architecture gap entry with three labelled sections. The top section describes the observed gap, with example text: Coder appended duplicate symbol on retry, file grew past 1,600 lines. The middle section explains why it is deferred: fixture files are 50 to 200 lines, workaround holds. The bottom section names the close condition: must resolve before any file over 500 lines. An arrow from the close condition leads to a trigger event, which branches to three possible outcomes: closed by fix, invalidated by a later decision, or closed by construction when the code path was deleted.](https://dontcodethisathome.com/images/architecture-gaps-entry-diagram.svg) ## Three shapes a gap takes after it is filed Not every gap gets fixed in the way the entry originally proposed. The document tracks three outcomes, and the reasoning for each is preserved, not just the status. **Closed by fix.** The trigger fired, the gap was fixed. The entry records what changed and when. This is the expected path and the least interesting one. **Invalidated.** A later architectural decision made the gap irrelevant. One early gap proposed building import-path resolution into the registry via Tree-Sitter walkers, so the Planner could tell the Coder which import paths to use. When the language server (LSP) was integrated, it provided the same resolution semantically and more accurately. The gap was marked `invalidated` with a one-line note: "LSP provides this. Do not build." Three other gaps in the same area were invalidated the same way, by the same architectural decision, on the same day. The entry stays in the closed archive so a reader who has the same idea later finds the reasoning rather than proposing the same work. **Closed by construction.** The code path the gap described was deleted entirely. When the pipeline moved from a file-reconstruction approach to an AST-based operation model, six gaps related to reconstruction corruption became unreachable: the code they described no longer existed. The entries were moved to the closed archive with a note linking to the design document that retired them. A reader examining "what happened to the reconstruction bugs" finds the answer without having to trace the git history. A small sample from the live document: | Gap | Why deferred | Close condition | Outcome | |---|---|---|---| | Coder appends duplicate symbols on retry instead of replacing. File grows past capacity, next attempt times out. | Fixture files are 50–200 lines; workaround holds at that scale. | First production-sized file (500+ lines). | **Closed by construction.** Code path deleted when the pipeline moved to AST-based operations. | | No structural gate preventing Coder from writing to test files. Prompt rule only. | Cost per occurrence is low on the fixture. | First real-project run where a test-file write burns an expensive retry. | **Closed by fix.** Sandbox enforcement added. | | Registry-based import-path resolution proposed via Tree-Sitter walkers. | Not yet needed; Planner infers paths from context. | When import resolution accuracy blocks a ticket. | **Invalidated.** Language server (LSP) provides the same resolution semantically. Do not build. | | Mock-target field is LLM-emitted; Planner can hallucinate the module path. | Backstop validator catches the common case. | When the validator's reject-and-retry loop fails to converge on a real ticket. | **Open.** Feasibility gate designed but not yet built. Close condition has not fired. | The fourth entry is included because it represents the state most gap entries spend the longest in: open, deferred, with a close condition that has not yet fired. The document's value is not in the entries that are closed. It is in the entries that are waiting. ## What the document is not The architecture-gaps document is not a backlog, not a sprint board, not a prioritisation tool. It does not answer "what to work on next." It answers "what is currently being bet will not break, and what would change that bet." The document is also not a replacement for a failure-modes document. (Earlier posts on this blog use the two names loosely; this post uses them precisely.) Gaps and failure modes are different kinds of things. A gap is a missing architectural feature. A failure mode is an observed failure. A failure mode can be caused by a gap, but it can also be caused by a model behavioural pattern, a prompt phrasing, or an environmental issue. The two documents cross-reference each other, but they track different dimensions. ## Where the document earns its keep The document earns its keep at two specific moments. The first is when a new capability arrives and several existing gaps suddenly become relevant. When the language server was integrated, three gaps about import resolution were invalidated in one pass, and two gaps about blast-radius detection moved from "deferred" to "in progress" because the technology they had been waiting for was now available. Without the document, those two gaps would have required someone to remember they existed. With the document, the close condition was already written, and grepping for "LSP" or "language server" surfaced them immediately. The second is when a decision about scope has to be made under pressure. The pipeline hit a problem where the Planner's output contained a field that was structurally unsatisfiable, a mock target that could never intercept the call it was supposed to mock. The obvious fix was to mechanise the mock-target field from the registry. The gaps document had an entry that said the underlying problem was not the mock target but the criterion shape, with a close condition tied to the feasibility gate being designed. Without that entry, the quick fix would have shipped and the real fix would have been deferred indefinitely, because the quick fix would have looked like it worked. With the entry, the scope decision was informed by a written record of the actual root cause. ## The honest limitation The document works only if it is read. It is discipline, not enforcement. Writing a gap down does not prevent it from being ignored. It does not guarantee the close condition is checked when the trigger fires. It does not prevent someone from shipping the quick fix anyway. What it does is make the risk visible. > A gap that is documented and deferred is a conscious bet. A gap that is undocumented and deferred is an invisible one. The first can be evaluated, challenged, and re-evaluated when conditions change. The second is discovered in production, under pressure, with no prior analysis to accelerate the diagnosis. The difference is not between perfect and broken. It is between "this was known, and here is why it was deferred" and "this was not known." That gap in visibility is, in my experience, where most of the expensive engineering surprises come from: not from problems that are hard, but from problems that were known and lost. ## What I would tell someone starting one Start it the first time a decision is correct now but will be wrong later. Do not wait until there is a formal process. Do not wait until there are enough entries to justify a document. One entry is enough if it has the three fields. The overhead is small. An entry takes a few minutes to write if you write it while the decision is fresh. Writing it later, when the context has faded and the reasoning has to be reconstructed, takes much longer and produces a worse entry. The three fields are non-negotiable. Without the close condition, the entry is a note. Without the reasoning for deferral, the entry is a task. With all three, the entry is a decision record that remains useful long after the person who wrote it has moved on to something else. --- _The pipeline runs inside Docker on real tickets. Gap entries referenced in this post are from the project's live architecture-gaps document, which has over 240 entries spanning the full lifecycle of the system. Still R&D._ ================================================================================ # Fixture-First Development as an Early Warning System for AI Pipelines URL: https://dontcodethisathome.com/fixture-first-development-as-an-early-warning-system-for-ai-pipelines Date: 2026-04-13 Description: Fixture-first development as an early warning system for AI pipelines: the first real-project run confirmed three known gaps instead of discovering new ones. Tags: AI Engineering, Autonomous Engineering Pipeline, Testing Dependencies: AI Coding Agents, TypeScript, Autonomous Engineering Pipelines Testing an LLM-based coding agent on a toy fixture is usually treated as throwaway work. The fixture is small, the tickets are artificial, and the real proof is supposed to come from running the pipeline against a production codebase. In my experience building this pipeline, the fixture phase turned out to be the opposite of throwaway. It was the cheapest, most reliable early warning system the project had. By the time I pointed my autonomous engineering pipeline at a real codebase for the first time, it had been through a long stretch of development: eighteen tickets validated on a fixture project, more than eighty failure modes catalogued in the project's [architecture-gaps document](https://dontcodethisathome.com/why-architecture-gaps-need-a-close-condition-not-a-backlog), and the full stage chain running end to end. The pipeline itself is an AI coding agent that plans, writes tests, writes code, and reviews its own work. The target was a TypeScript monorepo: ~100,000 lines of code, 150 test files, 953 tests, production route files over 1,000 lines. Nothing about it was controlled. It was not designed to be easy for an agent. The Planner read the codebase, identified the right files, resolved thirty-nine dependency signatures from the registry, and produced a valid manifest for eight and a half cents. Then the builders ran, and the infrastructure fell apart in three places. ## Three things broke on the first real-project run **1. The Coder could not handle large files.** The target was `jobs.ts`, 1,385 lines. The pipeline's reconstruction step is designed to splice the Coder's output back into the original file at the right location. On retry, instead of replacing the symbol it had already written, it appended a duplicate. The file grew past 1,600 lines. The next Coder attempt timed out at 300 seconds with 21,000 input tokens just for the source file, before it had seen the tests, the manifest constraints, or the Debugger's notes. **2. The Test Writer and Coder disagreed on the function's signature.** The Test Writer called `serializeJobsToCsv(rows, "lightweight")`, a string parameter selecting a column preset. The Coder implemented `serializeJobsToCsv(rows, columns: string[])`, an array of explicit column names: same function, different API shape. The Debugger correctly diagnosed the mismatch, but the Coder's fix still used the wrong shape, because the Test Writer's output was not in the Coder's context in the first place. **3. The Coder could write to test files.** A prompt rule said "do not modify test files." There was no structural gate enforcing it specifically on test-file paths. The pipeline does have a write gate that stops writes to files outside the manifest's modify list, covered in an earlier post on this blog ("[The Write Gate](https://dontcodethisathome.com/correct-code-wrong-file-how-the-write-gate-contains-scope-creep)"), but test files live in a distinct field of the manifest schema and the gate at the time of this run did not extend to them. The model could read the instruction and choose to follow it, or not. On the fixture the cost per occurrence was small. On a real project with expensive stages, "the model mostly follows instructions" is not a cost control. ## All three were already in the backlog What made this real-project run different from a typical "ship it and see what breaks" story is that none of the three failures were surprises. The large-file problem was catalogued as Gap #87 in the project's architecture-gaps document, logged long before the real-project run, the first time the Coder tried to modify one function in a fixture file and broke three others in the process. The fixture files were 50 to 200 lines, small enough that a workaround held. The entry in the document said: "this will break on production-sized files." It did, exactly as predicted, on a file seven times larger than anything the fixture contained. The signature mismatch was a known data-path gap. The Test Writer's output was not being injected into the Coder's context. On the fixture this was a minor annoyance, because the fixture's functions were simple enough that the Coder usually guessed the same signature. On a real project with domain-specific APIs, the divergence was immediate and unrecoverable without the data-path fix. The test-file write issue was catalogued as a failure mode early in the project's life. I had [deferred the structural fix](https://dontcodethisathome.com/intentional-technical-debt-building-features-in-the-wrong-order), a sandbox or API-mode enforcement, because the cost per occurrence was low on the fixture. The structural vulnerability was documented, the workaround was known, and the trigger condition was named. The real project confirmed all three; it did not discover any of them. ![Diagram showing the fixture phase on the left producing catalogue entries in the centre, and the first real-project run on the right. Arrows from each of the three failures point left to their pre-logged catalogue entries, showing the failures were predicted before the run happened. Bottom row shows fixture cost of roughly four dollars versus no cold debugging on the real project.](https://dontcodethisathome.com/images/fixture-first-gap-catalogue-diagram.svg) ## Why the gap catalogue was worth keeping The fixture was not a toy demo. It was a systematic early warning system, and the gap catalogue was what converted fixture-run incidents into predictions about the first real-project run. Every failure mode I catalogued during fixture runs was a prediction about what would go wrong at scale or on unfamiliar code. Some predictions were wrong, and some problems I expected never appeared. But the three biggest issues on the first real-project run were already in the backlog with gap numbers, root cause analyses, and proposed solutions. This changes the economics of the first real-project run. Instead of an open-ended debugging effort on unknown failures, the work collapsed to confirming known ones. The diagnostic was already written and the fix designs sketched; the run was informational, not exploratory. ## The catalogue effect A backlog tracks what a team plans to fix. > A gap catalogue tracks what the team has chosen *not* to fix yet, with the root cause, the workaround currently holding the line, and the specific condition that will force the fix. The first is a list of future work. The second is a running bet on which risks are safe to defer, and what signal will tell you when they are not. Most engineering teams keep the first. Almost none keep the second in writing. The difference shows up the first time the signal fires. Writing down every failure mode, including the ones that cost an extra attempt, produced slightly wrong output, or only appeared at low model effort, creates a searchable database of infrastructure risk. When the real project broke, I did not debug from scratch. I grepped the failure-modes document. "Reconstruct appends instead of replacing" was a specific instance of Gap #87. "Test Writer and Coder disagree on signatures" was a new failure-mode number, but the underlying gap was already documented as a missing data-path injection. The new number was five minutes of writing, not five hours of investigation. The fixture project cost roughly four dollars across eighteen tickets. The real-project Planner-only run cost $0.085. The full pipeline attempt that failed cost under a dollar. The ticket eventually shipped green across twenty-five pipeline runs that each exposed and fixed one infrastructure bug, for about $0.76 end-to-end on the final attempt. The total cost of proving that those three gaps were real blockers was under five dollars, and the diagnostic came pre-written. For reference, the same three gaps discovered cold on a production incident would cost at minimum an hour of engineering time each before a root cause was even named, and significantly more than that if the engineer handling the incident is new to the codebase. ## The scale gap: what a fixture cannot test Fixtures are small by design. Their small size is the point, not the limitation. The fixture proved that the brain layer works. The Planner reasons correctly about codebases, the Debugger diagnoses logic bugs, the Reviewer catches scope violations. These are the hardest things to build and the most expensive to debug. What the fixture could not test was scale: files over 500 lines, test suites that take four minutes to run, monorepo workspace configurations, CRLF line endings from a Windows clone. These are infrastructure concerns, not architectural ones. The real project proved the infrastructure did not scale. That is actually the better outcome. The brain, the part that is hard to fix, works. The plumbing, the part that is mechanical and diagnosable, did not. I would rather have it that way around. ## The brain transfers, the plumbing does not The Planner produced a valid manifest on its first attempt against a codebase it had never seen. It correctly identified the target file, the dependency chain, the test file location, and the acceptance criteria. Thirty-nine machine-extracted dependency signatures, all correct. The nine pipeline code fixes that landed alongside the first real-project run were all mechanical: shell operator handling in test commands, CRLF normalisation after git clone, workspace hoisting conflicts, LSP log paths, timestamped output directories. Each was a short edit; none required rethinking the architecture. This is the argument for fixture-first development. Build and validate the hard part in a controlled environment where failures are cheap and fast to diagnose. Then when the real project breaks, the failures are in the easy part, and the hard part already works. This is not TDD. TDD exercises the code under test. It is not canary either. Canary stages a rollout of already-shipped code. Fixture-first development exercises the *pipeline's reasoning* on a controlled project, so the expensive-to-debug architectural layers have already been proven by the time they meet unfamiliar code. One honest caveat before the next section: this is a single real project. The claim that the brain layer transfers across codebases survives or not as more real projects land. What this run proved is that it *can*, not that it *will*. The catalogue discipline is what makes the next real project cheap to run, regardless of whether this one ticket's success generalises directly. ## Where this stops being true Fixture-first works to the extent the fixture exercises the capabilities the pipeline claims to have. A fixture with only short files produces false confidence about large-file handling. A fixture with only one API style produces false confidence about signature variance across a real codebase. Every gap I predicted correctly was one I had already written down as a known limitation of the fixture itself. "This will break on production-sized files" is only a prediction because I knew the fixture did not contain files anywhere near the size a real codebase carries, not because the break itself was hard to foresee. The discipline that made the catalogue valuable was not writing down failures. It was writing down what the fixture did *not* cover. The gap catalogue does both; a backlog usually does only the first. ## What I would tell someone building an AI pipeline Do not skip the fixture phase. Build a small, controlled project that exercises every capability the pipeline claims to have. Run every ticket type against it. Write down every failure, including the ones worked around. And, more important than any of that, write down every capability the fixture *cannot* exercise, and why, even with no plan to fix it yet. The gap between "this was tested" and "this was admitted as untested" is where production surprises live. Closing that gap is cheap when it is writing and expensive when it is debugging. > Eighty failure modes feel like overhead when they are being written. They feel like foresight when the real project breaks and the diagnosis is already on the screen. --- *The pipeline runs inside Docker on real tickets. Numbers in this post are from actual runs against a TypeScript blog-API fixture and a ~100k line TypeScript monorepo. Session counts and gap numbers are snapshots as of the run being described, not current totals. Still R&D.* ================================================================================ # How claude -p Silently Inflates Your Pipeline Token Costs URL: https://dontcodethisathome.com/how-claude-p-silently-inflates-your-pipeline-token-costs Date: 2026-04-11 Description: Using claude -p in a pipeline? The model has bash access you never granted. Each tool call re-sends your full context. One sentence cuts token spend by 52%. Tags: AI Engineering, Autonomous Engineering Pipeline Dependencies: Claude Code CLI, LLM APIs One of the pipeline stages was using 34,691 input tokens per call. After adding one sentence to the prompt, it dropped to 16,520. Same model, same input, same context. The model had been running bash commands I never asked it to run. ## The discovery I was checking the call log after a run and noticed `turns: 4` on what should have been a single-turn call. In Claude subscription mode, `turns > 1` means the CLI made internal tool calls (bash commands, file reads, writes) inside the session. I had not asked for any of that. The prompt tells the model to analyse the provided context and return a JSON response. Nothing about running commands. But `claude -p` is Claude Code. It has tools: bash, read, write, web search. Even with `--system-prompt ""` suppressing the default instructions, the model still has access to those tools. This is easy to miss if you treat `claude -p` as a simple text-in/text-out interface. When you give it a prompt that mentions code symbols and file paths, it decides to be helpful and look things up itself. ![Two-column flowchart comparing expected single-turn model call against actual four-turn call. Left column "What you expect": pipeline prepares context, single model call (1 turn), returns response. Right column "What actually happens": same start, then 3 bash tool calls each re-sending ~16k tokens across turns 2–4, before finally returning a response.](https://dontcodethisathome.com/images/redundant-tool-loop.svg) ## The cost Each internal turn re-sends the full context. This stage's context is around 16k tokens: system prompt, symbol data, task description, project rules. With 4 turns, that is around 64k tokens of input, though the log reports 34k because some comes from cache. The real cost is not just tokens. It is the 3 redundant lookups that the pipeline had already done before the model call. The pipeline [resolves all the symbols](https://dontcodethisathome.com/what-calls-this-function-why-ai-coding-agents-need-a-language-server) mentioned in the task, passes the results as context, and then the model goes and looks up the same symbols again on its own. Duplicate work, invisible unless you check `turns` in the call log. ## The fix One sentence added to the prompt, injected only in subscription mode: > **Do not use any tools.** Do not run bash commands, do not look up symbols, do not read files. All the context you need is provided below. Work from it directly. In API mode, tools are only available if you explicitly pass them in the request. The model gets a single HTTP request with the prompt and returns a response. The fix is conditional: subscription mode gets the constraint, API mode gets nothing. ## The result | Metric | Before | After | Delta | |--------|--------|-------|-------| | Input tokens | 34,691 | 16,520 | **-52%** | | Output tokens | 2,754 | 1,760 | **-36%** | | Duration | 54.0s | 34.0s | **-37%** | | Turns | 4 | **1** | Fixed | One turn is the floor: the model receives the prompt and returns the response. The 3 extra turns were redundant tool calls it initiated on its own. The model found the same files, the same symbols, and produced the same structure. It did not need the extra lookups. ## Applied across all stages The Planner stage was the worst case (largest context, most temptation to explore), but every stage had the same exposure. All run in subscription mode with bash available. The constraint was applied to every prompt template and wired through a shared helper. In API mode, it does not apply and adds no overhead. ## The principle > Treating `claude -p` as a text-in/text-out interface does not make it one. Claude Code has tools. Those tools cost tokens, add latency, and produce side effects your pipeline did not account for. Three options, from cheapest to most structural: 1. **Prompt suppression** (what I did here): tell the model not to use tools. Works, but relies on prompt compliance. 2. **`--system-prompt ""`**: suppresses the default Claude Code system prompt. Does not remove tool access, it just removes the instructions that tell the model how to use them. 3. **API mode**: send a stateless HTTP request. Tools are only available if you explicitly include them. By default, the model has no bash, no file access, nothing beyond generating text. This is the structural fix. API mode output quality for these stages is unconfirmed until tested under production conditions. For R&D and subscription-based development, option 1 is sufficient. For production, option 3 is the right answer. ## What to check in your own pipeline Look for `num_turns` or `turns` in your call logs. If any stage shows `turns > 1`, the model is making internal tool calls. Each extra turn re-sends your full context and adds latency. You may be paying 2-4x what you think. ================================================================================ # Silent Data Destruction: The Write Path Bug in Agentic Pipelines URL: https://dontcodethisathome.com/silent-data-destruction-the-write-path-bug-in-agentic-pipelines Date: 2026-04-04 Description: The Coder added a new function to an existing file. The pipeline reported success. All seven existing functions were gone. Tags: Autonomous Engineering Pipeline, Agent Safety Dependencies: AI Agents, TypeScript, File I/O The Coder added a new function to an existing file. The pipeline told me it succeeded. All the existing functions in that file were gone. No error, no warning, no exception in the pipeline output. The pipeline wrote the file, ran the tests, and 48 tests failed with `_reset is not a function`. The file that used to export seven functions now contained one. ## What happened The pipeline has a reconstruction step. When the Coder targets a specific function in an existing file, the pipeline looks up that function's location in the registry, splices in the new version, and leaves everything else untouched. This has worked reliably on every ticket that modified existing functions. The bug appeared when the Coder added a *new* function. The reconstruction logic assumed every symbol already existed in the registry. When it did not find one, it fell back to writing the Coder's raw output as the entire file. Everything else was overwritten. The fix was adding fallback logic so the reconstruction step handles new symbols, not just existing ones. If the registry has no record of the symbol, the pipeline now finds the right insertion point rather than replacing the entire file. ## Why this class of bug matters > This is the scariest category of failure in agentic systems: silent data destruction. The pipeline reported success. The file was written. If the test suite had not caught it, the commit would have gone through with a destroyed module. The test suite caught it because the destroyed functions were imported by other test files. If the destroyed code had no tests, the damage would have been invisible until someone opened the file. Every write path in an agentic pipeline needs to be tested against the "symbol does not exist yet" case. The happy path (modify existing code) works. The edge case (add new code to an existing file) is where data gets destroyed. ![Stats card showing three consecutive runs hit the same bug, 48 tests failed on each run from imports of the destroyed module, zero errors or warnings in pipeline output, and a 15-line fix to the reconstruction logic.](https://dontcodethisathome.com/images/bug-ate-file-stats.png) The bug existed because the reconstruction path was only tested with existing symbols. Adding new symbols to existing files was a case nobody had written a test for, because every previous ticket either created new files or modified existing functions. --- _Numbers are from real runs against a TypeScript blog API fixture. The pipeline runs inside Docker with no human in the loop between ticket input and reviewed, committed code._ ================================================================================ # Four Pipeline Bugs That Only Surface With Less Capable Models URL: https://dontcodethisathome.com/four-pipeline-bugs-that-only-surface-at-lower-model-effort Date: 2026-03-31 Description: A ticket that passed twice failed four times at lower model effort, exposing four structural pipeline bugs the higher-effort run had masked. Tags: AI Engineering, Autonomous Engineering Pipeline, Testing Dependencies: LLM APIs, TypeScript, Autonomous Engineering Pipelines A ticket that had passed cleanly twice before failed four times in a row. Same ticket, same fixture, same pipeline. The only change: a lower-effort model setting. Not a different model. The same model, told to spend less compute per token. Each failure was a different pipeline bug. All four had existed during the previous runs. The higher-effort setting had worked around every one of them. ## The setup The pipeline runs five agents in sequence: Planner, Test Writer, Coder, Debugger, Reviewer. Each agent runs in its own isolated session with no shared state. The ticket was a category filter on a blog API endpoint. It had passed twice before, producing clean one-attempt approvals in about 55 seconds. On the third run, the same ticket hit four distinct failures across four consecutive attempts. ![Table showing four runs of the same ticket. The first two passed cleanly. The third failed four times, each at a different pipeline stage. After fixing all four bugs, the same run passed in 54 seconds.](https://dontcodethisathome.com/images/dumber-model-session-comparison.png) ## Failure 1: constraint in the wrong location The Test Writer produced CommonJS `require()` statements in a project that uses ESM imports. The pipeline had a constraint saying "use ESM imports" injected into the prompt. The higher-effort run had followed the constraint anyway, inferring intent from the surrounding context. At lower effort, the model ignored it. The fix was moving the same constraint to a more prominent position in the prompt, where the model follows it regardless of effort level. Same words, different position. The fix is invisible in a diff. This is the hardest class to find. The constraint exists, the tests pass at higher effort, and nothing in the output tells you the model is compensating for poor placement. It only surfaces when [the model stops compensating](https://dontcodethisathome.com/llm-non-determinism-is-a-pipeline-failure-not-a-model-problem). ## Failure 2: no recovery for unexpected output format The Debugger stage returned XML markup instead of JSON. The output parser got an empty string and crashed. The error handler fired a cleanup routine with a misleading message about leftover files when it was actually cleaning up the current run's output. The higher-effort run had never produced XML output, so the parser had never been tested on it. The fix: constrain the Debugger's output format, add defensive parsing for unexpected markup, and make the cleanup message report the actual cause instead of assuming one. ## Failure 3: missing context The Coder failed all three attempts on the same test. The implementation needed a pagination utility that existed in the codebase but was not in the Coder's context. The higher-effort run had inferred the import by reasoning about the surrounding code. At lower effort, the Coder tried to reimplement pagination from scratch and got the edge case wrong every time. The fix was making the context explicit. The utility belonged in the Coder's input, not in the model's ability to discover it. Every time the pipeline relies on the model "figuring out" something that could be stated directly, that is a structural gap waiting for a weaker run to expose it. ## Failure 4: insufficient error feedback The Test Writer produced a syntax error. The retry mechanism fired, but it sent back a generic "syntax error, try again" with no details. The higher-effort run had never produced a syntax error at this stage, so the retry path had never been exercised with real data. The fix was including the actual test runner output in the retry message. ![Pipeline flow diagram with six stages. Failures 1 and 4 hit the Test Writer. Failure 2 hit the Debugger. Failure 3 hit the Coder after exhausting all three attempts. Each failure is at a different stage.](https://dontcodethisathome.com/images/dumber-model-pipeline-failures.svg) ## The principle > On a previous project, I tested on the oldest, weakest phone with the highest usage share in the analytics. If the app ran smoothly on that device, it ran smoothly on everything newer. You test on the floor, not the ceiling, because the floor is where the structural problems live. The same applies to AI pipelines. A pipeline that only works because the model is smart enough to fill in what you forgot to specify is not a robust pipeline. The correctness should live in the structure, not in the model's ability to recover from your gaps. The lower-effort setting did not create these bugs. It revealed them. Each fix made the pipeline more reliable at every effort level, including the one that had been passing all along. ![Table classifying four failures by stage, visible symptom, and structural root cause. Each failure has a different root cause requiring a different class of fix.](https://dontcodethisathome.com/images/dumber-model-failure-taxonomy.png) If your pipeline only shows clean runs at high effort, you do not know where the real guardrails are. --- _Numbers are from real runs against a TypeScript blog API fixture. The pipeline runs inside Docker with no human in the loop between ticket input and reviewed, committed code._ ================================================================================ # LLM Non-Determinism Is a Pipeline Failure, Not a Model Problem URL: https://dontcodethisathome.com/llm-non-determinism-is-a-pipeline-failure-not-a-model-problem Date: 2026-03-28 Description: Same ticket, same pipeline config, different result two days apart. Why the first run passing was not confirmation that the constraint was enforced. Tags: AI Engineering, Autonomous Engineering Pipeline, Testing Dependencies: LLM APIs, TypeScript, ES Modules A category filter ticket passed on the first run. 55,556 tokens, one Coder attempt, clean APPROVE. Two days later, same ticket, same codebase, same pipeline config: it failed at the Test Writer stage. The instinct when this happens is to call it a regression, but a regression implies something changed. Nothing changed. What the second run revealed was that the first run had been relying on the model making the right call, and on the second run it did not. That is a different kind of failure. ## The feature The ticket added category filtering to the posts API. `GET /api/posts?category=typescript` returns only posts tagged with that category. The Planner had to identify the posts handler and the categories lib as the relevant scope, and reason that this was a cross-domain change: the posts handler would import from the categories lib for the first time. On the first run, it handled all of that correctly. The Test Writer produced a test file for the new route, the Coder implemented the handler change on attempt 1, the Reviewer approved. 61 seconds from ticket to committed code. ## What the Test Writer produced on the second run The second run failed at the Test Writer stage. The test file it produced included this line inside a test body: ```typescript const categories = require('@/lib/categories') ``` This is valid JavaScript syntax. On a TypeScript project configured for ESM with `"module": "esnext"` in the compiler config, it is the wrong tool for the job. The test runner executes files as ES modules. `require()` is a CommonJS function. It runs, but the resolver it uses is Node's native CommonJS resolver, which has no knowledge of path aliases like `@/`. The test runner's alias configuration only applies to `import` statements processed through its own transform. `require()` calls bypass that entirely and go directly to the native resolver. ## The error ```text FAIL tests/posts-category-filter.test.ts > GET /api/posts with category filter > returns posts filtered by category slug Error: Cannot find module '@/lib/categories' Require stack: - /workspace/tests/posts-category-filter.test.ts ❯ tests/posts-category-filter.test.ts:7:30 Test Files 1 failed (1) Tests 1 failed (1) ``` `lib/categories.ts` existed. It was correctly indexed in the registry. The Planner had used it to scope the manifest two minutes earlier in the same run. The error is identical to what you would see if the file did not exist at all. ## The failure chain The Debugger received the test output and the code on disk. Its job is to diagnose logic failures from evidence: the failure message, the implementation, the manifest scope. What it received said `Cannot find module '@/lib/categories'`. The most direct reading of that evidence is that `lib/categories.ts` is missing. The Debugger diagnosed a missing file. The Coder, following the diagnosis, attempted to create `lib/categories.ts`. The write gate compared the attempted path against the manifest. `lib/categories.ts` was not listed. The pipeline halted. Three stages ran correctly before the Debugger produced a plausible diagnosis from misleading evidence. The source of the misleading evidence was a single `require()` call in the test file, which produced an error indistinguishable from a genuinely absent module. ## Why the first run was not confirmation The Test Writer has a general rule about import style. On the first run, the model applied it correctly and used `import` throughout the test file. On the second run, the same model with the same prompt used `require()` in one location inside a test body. LLM outputs are not deterministic. A rule in a prompt is a rule the model tends to follow. It is not a constraint the pipeline enforces. That distinction only becomes visible when the model does not follow the rule, which happens non-deterministically across runs, with no warning and no change in configuration. > The first run passing was evidence that the prompt rule usually works. It was not evidence that the pipeline enforces it. The difference between those two things is invisible until the second run. ## The fix The pre-flight stage now detects the project's module format automatically. If the project uses ESM, the Test Writer receives a specific constraint for that run: this is an ESM project, use top-level `import` statements throughout, never `require()` inside test bodies. This is not a general rule about good practice. It is a specific instruction derived from the project itself, applied at the start of every run. The model does not have to recall the rule or generalise from a prior example. The system reads the project and tells it what applies. On the re-run after the fix: APPROVE, one Coder attempt, 56,267 tokens, 61 seconds. The Test Writer produced correct `import` statements throughout. ![Terminal output showing a category filter pipeline run: Planner, Test Writer, Coder attempt 1, Reviewer, all stages passing, verdict APPROVE, 56,267 tokens, 61 seconds total runtime.](https://dontcodethisathome.com/images/the-regression-that-was-not-a-code-change-og.png) ## The generalizable part A prompt rule and a structural constraint look identical inside a prompt. The difference is in the failure mode. A prompt rule can fail on any run where [the model's output drifts](https://dontcodethisathome.com/four-pipeline-bugs-that-only-surface-at-lower-model-effort) from what the rule intended. The failure is silent: the output is syntactically valid, it passes no gate until the test suite runs, and the resulting error looks like a different problem entirely. You are debugging a missing-file error when the actual cause is an import style violation two stages upstream. A structural constraint derived from the project's own configuration works differently. If the config file is present, the constraint fires on every run, without relying on the model's consistency across invocations. If the config file is missing, the constraint is not injected and the pipeline behaves as before, which is honest, because there is nothing to derive. The practical question when writing a prompt rule is: what happens when the model ignores it? If the answer is a silent failure that looks like a different problem downstream, the rule is carrying less weight than it appears to be. For a property that must hold on every run, the reliable approach is to derive it from something the system can read, not something the model is expected to recall. ## What this does not solve Project config detection only works when there is a config file to read and the relevant constraint is expressed in it. A project with a non-standard build setup, or one where the build tooling diverges from what the compiler config declares, produces no injected rule. Prompt coverage is the only protection in those cases, and it carries the same non-determinism as before. The blog fixture is well-structured and uses standard TypeScript configuration. Whether this approach holds on projects where the build tooling and compiler config are out of sync is not yet tested. That is the next class of failure to find. --- *Numbers are from real runs against a TypeScript blog API fixture. The pipeline runs inside Docker. Still R&D.* ================================================================================ # Intentional Technical Debt: Building Features in the Wrong Order URL: https://dontcodethisathome.com/intentional-technical-debt-building-features-in-the-wrong-order Date: 2026-03-21 Description: The pipeline committed code before branch isolation existed. The risk was real, named, given a close condition. That is what makes it different from a shortcut. Tags: AI Engineering, Autonomous Engineering Pipeline, Agent Safety Dependencies: Git, Autonomous Engineering Pipelines, Software Architecture The pipeline commits code after a Reviewer approves it. That feature was built before the feature that controls which branch those commits land on. The risk was real and known. For several sessions, a pipeline APPROVE meant agent-written code committed to whatever branch happened to be active. On the test fixture project that was acceptable, because the fixture is disposable. On a real codebase, it would mean agent-written commits landing directly on master with no feature branch to delete and no clean rollback path. The reason for the ordering was not impatience. It was that without commits, the pipeline could not prove it worked. ## What a commit actually proves By the time the pipeline reaches a commit, a lot of things have already been verified: the Planner decomposed the ticket correctly, the Test Writer produced tests that passed the quality gate, the Coder made those tests pass within its authorized scope, the Reviewer evaluated the diff and issued APPROVE. What has not been verified is whether any of that produced real, committed output. A pipeline run that produces changed files but no committed state has not proven the full cycle. It has proven parts of the cycle. > The git commit is not just record-keeping. It is the point at which the pipeline proves it can deliver a finished unit of work that survives a branch checkout, a `git log`, a code review. Until there is a commit, you have a system that transforms input files into output files. The commit is what makes it a software delivery system. That proof matters early, before other components are built on top of it. If the commit logic was broken or the commit message format was wrong or the git hooks rejected the commit, those failures needed to be discovered before sessions of other work accumulated on top of an unproven delivery step. Building commit support first meant that from that point onwards, every pipeline run produced verifiable committed output. Every subsequent session of work was validated against that proof. ## The specific risk Without branch isolation, every APPROVE commit lands on the active branch. For a single test run on a clean fixture, the risk is theoretical. For a sequential multi-ticket run, it compounds. The scenario that warranted documenting: a sequential run starts on master, runs five tickets, two succeed and three fail in ways that produce wrong commits before the failure is detected. Recovery requires identifying exactly which commits to revert, in which order, without overshooting into commits that were correct. On a test fixture with a short history and no other contributors, that is recoverable. On a project with active development, it is a problem. ![Git history showing four commits on master in sequence: human work, APPROVE ticket 1, APPROVE ticket 2 (wrong), APPROVE ticket 3 (wrong). The wrong commits are tangled into the main branch with no clean rollback path.](https://dontcodethisathome.com/images/git-commits-no-branch-isolation.svg) With branch isolation, each ticket gets its own branch. A bad run produces a branch to delete, not commits to unpick from master. ![Git history showing master with one commit and two separate feature branches, pipeline/ticket-1 and pipeline/ticket-2, each with a single APPROVE commit. A bad run means deleting a branch, not reverting commits on master.](https://dontcodethisathome.com/images/git-commits-with-branch-isolation.svg) The risk was written down immediately with a specific close condition: must resolve before pointing the pipeline at anything real. Branch isolation was built shortly after, and it was closed before that trigger arrived. > Intentional technical debt and accidental technical debt look identical in the codebase. What makes them different is whether the risk was named and given a close condition at the time it was accepted. One gets paid on schedule. The other compounds silently. --- *Dates and sequencing are from actual project history.* ================================================================================ # Why a Warning Is Worse Than a Hard Stop URL: https://dontcodethisathome.com/why-a-warning-is-worse-than-a-hard-stop Date: 2026-03-13 Description: When the pipeline detects zero test files, logging a warning and continuing produces output that looks correct but cannot be caught by any downstream gate. Tags: Autonomous Engineering Pipeline, Agent Safety, Testing Dependencies: TypeScript, Jest, Test-Driven Development The first version of the preflight check logged a warning and continued. Zero test files detected. Warning written to the log. All five agents proceed. That felt reasonable. Greenfield projects have no tests yet. The pipeline is designed for exactly that situation: a ticket comes in, the Test Writer writes failing tests, the Coder implements against them, the pipeline proves the code works. Why block a greenfield project at the gate? The answer is in what the Test Writer does before it writes a single test. ## What the Test Writer reads first Before producing any test output, the Test Writer reads the existing test suite. Not to check coverage or find gaps. To learn conventions. It is looking at things like: how tests in this project are structured (`describe`/`it` nesting, `test` vs `it` preference), how imports are written (relative paths, barrel exports, path aliases), what assertion style is used (`expect(res.status).toBe(200)` vs `expect(res).toMatchObject({status: 200})`), how async tests are handled (`async/await` vs `.resolves`/`.rejects`), whether tests use a shared server setup or create a fresh instance per test, and whether there are common mock patterns for the database or external services. These are not things the Test Writer guesses from the ticket. They are things it reads from the codebase. The quality of its output is directly bounded by the quality of the examples it can observe. With zero examples, it cannot observe anything. It writes tests that look structurally plausible based on its training data. For a TypeScript project, those tests will probably use Jest syntax, probably use `describe` blocks, probably have reasonable imports. They will be internally consistent. They just will not match the actual conventions of the project, because there is nothing to match against. ## Why that matters downstream The Coder receives the test file the Test Writer produced. It implements against those tests. If the Test Writer used relative import paths like `'../../lib/posts'` but the actual project uses path aliases like `'@/lib/posts'`, the Coder will write implementation code that follows the pattern in the tests. The tests pass. TypeScript might not complain. The pipeline reports success. The problem surfaces later, when the code is reviewed or deployed, or when another ticket's Coder follows the same conventions and adds to the divergence. Or it does not surface at all, and the codebase accumulates a pattern that does not match the rest of the project. The key point: no stage in the pipeline flags this. The Test Writer did not fail. The Coder did not fail. The tests passed. The Reviewer sees a green run and approves. The pipeline produced output that is wrong in a way that is invisible to every gate it passed through. ![Flowchart showing two paths from a preflight check: the warn-and-continue path leads to invented conventions and undetected wrong behaviour; the hard-stop path leads to adding a bootstrap test and correct output.](https://dontcodethisathome.com/images/hard-stop-preflight-flowchart.svg) ## Why a warning does not catch this A warning in the log does not interrupt the pipeline. In a pipeline running inside Docker with no human in the loop, a warning at the preflight stage is visible only if someone reads the full run output afterwards. The pipeline still runs. Tokens are still spent. The Test Writer still generates tests from guesswork. The code still gets committed. The point of the preflight check is to stop before anything is spent. Zero test files means the Test Writer is about to operate without the input it needs to do its job correctly. Logging that as a warning and continuing is the same as logging "warning: fuel gauge reads empty" and continuing the drive. The warning is accurate. The outcome is unchanged. A hard stop changes the outcome. The pipeline halts with a specific, actionable error before any agent runs, before any tokens are spent, before any code is written. Nothing is committed. The state of the repository is unchanged. ```text Pre-flight FAILED: no test files found. Add at least one bootstrap test before running the loop. ``` That error costs nothing to produce and nothing to act on. ## The bootstrap test The correct response to a hard stop is not to weaken the gate. It is to add one test. Not a full suite, not comprehensive coverage. One test that establishes the conventions the Test Writer needs to learn from. For a TypeScript API project, something like this is sufficient: ```typescript import request from 'supertest' import { app } from '../lib/server' describe('smoke', () => { it('API responds', async () => { const res = await request(app).get('/api/health') expect(res.status).toBe(200) }) }) ``` That single test tells the Test Writer several things: the project imports from `'../lib/server'`, it uses `supertest` for HTTP testing, it uses `describe`/`it` blocks, it uses `async/await`, it accesses the response status with `res.status`, and it uses `expect().toBe()`. That is enough context to write tests that match the project's conventions. Three lines of setup is not a burden. It is a prerequisite. A pipeline that can infer conventions from one example is more useful than a pipeline that runs without examples and produces output that looks correct but is not. ## Where hard lines belong The preflight check is one instance of a general decision: where in a system does a hard stop produce a better outcome than a warning? The answer depends on what the warning allows through. If the warning allows through a run that produces output which looks correct but is wrong in a way [the downstream gates cannot detect](https://dontcodethisathome.com/the-quality-gate-that-passed-when-it-failed), the warning is not a softer version of a hard stop. It is a mechanism for producing undetectable failures at the cost of appearing flexible. For the Test Writer, the downstream gates are the test results themselves. If the tests pass but were written from invented conventions, the tests are not a valid check. They are a tautology: the Coder implemented against what the Test Writer wrote, so of course the tests the Test Writer wrote pass. The quality gate is only meaningful if the tests were written from real conventions that reflect what the code should actually do. > The hard stop exists because a loud error at startup is recoverable. A quiet failure that passes every gate is not. One honest caveat: the hard stop checks count, not quality. A project with ten poorly-written tests will satisfy the check. The Test Writer will learn from whatever examples exist, good or bad. The gate establishes a floor (at least one example), not a ceiling. If the existing tests are misleading, the Test Writer will be misled by them. The gate cannot fix that. It can only ensure that the pipeline does not run with no examples at all. --- *The pipeline runs inside Docker on real tickets. Still R&D.* ================================================================================ # Correct Code, Wrong File: How the Write Gate Contains Scope Creep URL: https://dontcodethisathome.com/correct-code-wrong-file-how-the-write-gate-contains-scope-creep Date: 2026-03-06 Description: On attempt 3, the Coder tried to write a file that was not in the manifest. The write gate stopped it before anything hit disk. This is what it is for. Tags: AI Engineering, Autonomous Engineering Pipeline, Agent Safety Dependencies: AI Agents, TypeScript, File I/O On attempt 3, the Coder tried to write `lib/users.ts`. That file was not in the manifest. The pipeline stopped before the write reached disk: ```text ERROR: Coder attempted to write outside manifest scope: ['lib/users.ts'] ``` This is what the write gate is for: not wrong code, but correct code written to the wrong files. ## The incident BLOG-010: add comment author validation to the comment handler. The manifest authorized the Coder to touch two files: `pages/api/comments.ts` and `lib/comments.ts`. The Test Writer produced tests that called the handler with `authorName: 'Frank'` and expected a 201 response. Frank was not in the seed data. The validation logic returned 400 for any author not in the database. There was no path to 201 with Frank as the author. The test assertion was wrong from the start. The Coder did not know this. From inside the loop, a failing test looks the same whether the test is wrong or the implementation is wrong. The Coder made two normal attempts: adjusting import paths, reworking the handler logic. Neither resolved it. The test kept failing. On attempt 3, the Coder found a solution that would work. If Frank existed in the seed data, the validation would pass, the handler would return 201, the test would pass. So the Coder wrote `lib/users.ts` to add Frank as a seed user. The write gate fired before anything hit disk. Pipeline halted. No file was written. ## Why the reasoning was correct The Coder's solution was logically valid. Given the test as written, adding Frank to the seed data was a correct fix. The implementation in `pages/api/comments.ts` was already correct: the handler validated authors against the database and rejected unknown authors exactly as designed. The only thing preventing 201 was that Frank did not exist. This is what makes containment hard. The out-of-scope action was not irrational. It was a correct response to an incorrect constraint. The Coder was not malfunctioning. It was doing exactly what it was designed to do: find a set of changes that makes the tests pass. That is precisely why the containment cannot rely on the agent recognising that its solution is wrong: from inside the loop, it cannot make that judgment. It found a valid path to green tests and took it. ## What would have happened without the gate `lib/users.ts` would have been modified. Frank would appear in seed data. The tests would pass. The Reviewer would see a clean run and approve. The PR would merge. Later: a different feature has tests that create a fresh database state and assert the user count. The count is off by one and no one knows why. Or a feature that processes all users encounters unexpected state: a user with no creation timestamp, no activity history, no associated data beyond a name. Or a security audit finds the anomaly and asks where it came from. The corruption is invisible at the point it is introduced. It surfaces as a mystery elsewhere. This is the specific failure mode the gate prevents. Not obviously wrong output. Correctly-passing output that embeds a wrong assumption into shared state, where it will cause problems that are completely disconnected from the commit that introduced them. ## The containment model Every manifest declares `files_to_modify`: the list of files the Coder is authorized to touch. The Planner produces this list before the Coder runs, based on what the ticket actually requires. The write gate compares every attempted write against this list. Any file not on the list is blocked, regardless of whether the write would produce passing tests. The same list drives the enrichment step that populates the Coder's source context, and a separate failure mode exists where [over-broad path matching floods the context with noise](https://dontcodethisathome.com/how-filename-lookups-flood-an-ai-coding-agents-context-window). For BLOG-010, the manifest declared: ```json { "files_to_modify": [ "pages/api/comments.ts", "lib/comments.ts" ] } ``` `lib/users.ts` was not on the list. That is the entire check. ![Flowchart showing the write gate decision: a write from the Coder goes to the write gate, which either allows it to disk if the file is in the manifest, or raises an out-of-scope error and halts the pipeline if it is not.](https://dontcodethisathome.com/images/write-gate-flowchart.svg) The gate runs before the write reaches disk. It does not check the content of the write or reason about whether the change is appropriate. It checks one thing: is this file in scope? That simplicity is intentional. > A gate that tries to reason about correctness can be argued with. A gate that checks a list cannot. ## The actual fix The test was wrong. `tests/comments.test.ts` had been written with `authorName: 'Frank'`, an arbitrary name that happened not to be in the seed data. Changed to `authorName: 'alice'`. BLOG-010 re-run: APPROVE on attempt 1, 45,116 tokens, 41 seconds. The Coder never needed to touch `lib/users.ts`. The implementation in `pages/api/comments.ts` was correct the entire time. Three Coder attempts and a write gate firing were caused by a single wrong string in a test assertion. The write gate did not fix the test. A human fixed the test. What the gate did was prevent the pipeline from producing a passing result that masked the real problem, and surface it as an explicit error instead. ## Agents expand scope when blocked This is not a quirk of one model or one ticket. It is an emergent property of any system that optimises toward an objective without hard boundaries. The objective is to make the tests pass. If the tools available within scope cannot achieve that, the agent will look outside scope. It will not decide to stop trying. It will find a way. The write gate does not prevent the agent from trying to expand scope. It prevents the try from succeeding. The distinction matters because the behaviour will recur. Different tickets, different agents, different models. The pattern is the same. Block a goal-directed system from achieving its goal within defined limits and it will push against those limits. The design implication is that [containment has to be structural](https://dontcodethisathome.com/prompt-rules-are-advisory-validators-are-binding), checked at the point of action, not at the point of decision. By the time the Coder decided to write `lib/users.ts`, it had reasoned its way to that decision through three attempts. Asking it to reconsider at the point of decision would require giving it information it does not have: that the test was wrong, not the implementation. The gate does not need that information. It checks the list. > The manifest scope is not just a performance optimisation that limits the blast radius of a bad run. It is a safety mechanism. The two properties are the same property: an agent that cannot write outside its scope cannot corrupt shared state outside its scope, regardless of what it decides to do. --- *The write gate is a hard constraint, not a heuristic. It will block correct writes if the manifest is wrong, and it will not catch incorrect writes within the manifest scope. The Planner's scoping decision is load-bearing: a badly-scoped manifest means a badly-scoped Coder. The gate enforces the boundary. It does not validate the boundary itself.* *Numbers are from actual runs against a TypeScript blog API fixture. Pipeline runs inside Docker. Still R&D.* ================================================================================ # Why the Debugger Never Inherits the Coder's Reasoning URL: https://dontcodethisathome.com/why-the-debugger-never-inherits-the-coders-reasoning Date: 2026-02-27 Description: The Debugger receives the test failure and the code on disk, not the Coder's reasoning. That isolation is not a constraint. It is the design. Tags: AI Engineering, Autonomous Engineering Pipeline, Testing Dependencies: AI Agents, TypeScript, Test-Driven Development ![Card showing a failing test assertion (Expected: 100, Received: 200) and the Debugger's diagnosis: the utility already clamps perPage to 100, so the handler's guard is unreachable. Corrective brief says to remove the redundant guard and trust the parsed value. 561 tokens.](https://dontcodethisathome.com/images/debugger-test-failure-diagnosis.png) The test output was specific: ```text FAIL tests/posts.test.ts ● paginated posts › clamps perPage to 100 maximum expect(received).toBe(expected) Expected: 100 Received: 200 ``` The Coder had written a handler that accepted a `perPage` parameter, clamped it at 100, and returned paginated results. The guard was there. The logic looked correct. The test said it was not. The Debugger's diagnosis was specific: the `parsePaginationParams` utility already clamps `perPage` to 100 before returning, so the handler's guard fires on the post-clamp value and can never trigger. The endpoint returns 200 items because the guard is unreachable. The corrective brief: remove the redundant guard and trust the parsed value. 561 output tokens. The Coder read the brief, made a targeted patch, the tests passed. That is the Debugger's job. ```diff // pages/api/posts.ts const { page, perPage } = parsePaginationParams(req.query) - if (perPage > 100) { - return res.status(400).json({ error: 'perPage max is 100' }) - } const posts = await getPosts({ page, perPage }) ``` ## What the Debugger receives The Debugger's inputs are fixed: - The manifest: the ticket's declared scope, files, and acceptance criteria (produced by the Planner, whose accuracy depends on [knowing the blast radius](https://dontcodethisathome.com/what-calls-this-function-why-ai-coding-agents-need-a-language-server)) - The test file: the full content of the failing test suite - The test output: the exact failure from the most recent run - The last implementation the Coder wrote - The attempt history: what the Coder tried on previous passes That last one is worth pausing on. The Debugger can see the code the Coder produced on each attempt. What it cannot see is the Coder's reasoning: the chain of thought, the assumptions, the decision process that led to each attempt. That is intentional. > An agent that inherits another agent's reasoning inherits that agent's wrong assumptions. If the Coder convinced itself that the problem was in the import path, and the Debugger reads that reasoning, it will be pulled toward the same conclusion. The Debugger starts from the test output and the code on disk, not from a narrative the Coder constructed about why its approach should work. The other bound is the codebase itself. The Debugger does not receive the full project, only the files the manifest authorizes. It cannot look outside that scope to explain the failure. An agent with the full codebase will find explanations that are plausible but not relevant. The manifest scope forces the diagnosis to stay grounded in what the Coder was actually working on. The output is structured JSON. The schema is not ceremonial. An agent asked to produce free-form diagnosis will produce free-form diagnosis: observations, possibilities, suggestions. Structured output forces a specific claim ("the guard fires on the post-clamp value") instead of a general one ("there may be an ordering issue in the validation logic"). The Coder can act on the first. The second produces another Coder attempt that guesses. ## Structural failures and logic failures The Debugger is for logic failures. It is not for structural ones, and the distinction matters. A structural failure is a type error, a missing import, a function called with the wrong signature, a caller that was not updated when an interface changed. The cause is a gap in scope or a resolved-name mismatch. These do not require reasoning about what the code does. They require knowing what the code references. The language server handles these before the test suite runs. A diagnostic pass runs between every Coder write and the tests. Import errors and type mismatches get fixes applied in-place, without spending a Debugger pass. What reaches the Debugger is a logic failure: the code is structurally correct, the types check, and the tests still fail because the implementation does the wrong thing. The `perPage` clamping example is a logic failure. The guard was present, the types were correct, the logic was wrong. That separation matters for the retry budget. A Debugger pass costs a full inference cycle. Before LSP, Debugger passes were spent on both categories: structural failures that a diagnostic pass would now catch automatically, and logic failures that actually need reasoning. Post-LSP, the Debugger is reserved for the second category. The budget goes further. ## A second example BLOG-009: migrate `Post.tags` from `string[]` to `Tag[]`. The Coder updated the type definition, updated the seed data, and updated the implementation. The type migration was correct. The tests failed with a TypeScript error: ```text tests/posts.test.ts:124:24 - error TS2345: Argument of type 'string' is not assignable to parameter of type 'Tag'. 124 expect(p.tags.includes('typescript')).toBe(true) ``` The Coder had changed the type. The existing tests had not been updated to match. They were still asserting `p.tags.includes('typescript')`, passing a string to `includes()` on a `Tag[]` array. TypeScript rejected it. Two errors, same pattern, different assertions. The Debugger identified the mismatch: the tests were still asserting `p.tags.includes('typescript')`, passing a string to `includes()` on what was now a `Tag[]` array. The migration code was correct. The tests were written against the old string shape and needed to match the new `Tag` object shape. It also flagged reverting the type migration as an approach to avoid. The fix was to the assertions, not the implementation. `p.tags.includes('typescript')` became a check against `tag.slug`. APPROVE on the next Coder attempt. 99,682 tokens total, 134 seconds. This category of failure is worth naming separately: type migration consequence. The Coder changed the type and the data shape correctly. The existing tests were written against the old shape. The Debugger caught the mismatch the Coder left behind. ## What the Debugger cannot do The Debugger cannot fix a wrong acceptance criterion. If the ticket says "return 404 when post not found" and the acceptance criterion was written as "return 400 when post not found," the Debugger will diagnose the implementation as correct and flag the test as wrong. That is accurate, but it is not what the Test Writer intended. That failure surfaces either at the Reviewer stage or not at all. It cannot fix a test written against a wrong interface. If the Test Writer assumed a function signature that the implementation does not match, the Debugger will identify the mismatch. It cannot determine which side is authoritative. It cannot recover from a Planner that scoped the wrong files. If the Planner missed a caller and the Coder's implementation broke it, the Debugger will see a failing test for a file that is not in its inputs. It will flag that it cannot diagnose the failure. It does not have access to the broken caller. That failure surfaces as a pipeline halt, not a wrong fix. > The Debugger's scope is exactly as narrow as the manifest scope. That narrowness is what makes it reliable. An agent with unlimited reach will find a fix for any failure, including the wrong ones. The Debugger can only fix what is inside its authorized scope, and it can only explain failures using evidence the test output provides. For failures outside those bounds, the Reviewer is the backstop. The Debugger is not trying to cover every failure mode. It is trying to cover the failure mode it can actually verify: logic errors in correctly-scoped, structurally-valid implementations. --- *The pipeline runs inside Docker on real tickets. Numbers are from actual runs against a TypeScript blog API fixture. The pipeline is still R&D. The Debugger's evidence-quoting requirement catches most logic failures, but there are classes of failure it will not diagnose correctly, and the Reviewer stage exists partly for that reason.* ================================================================================ # What Calls This Function? Why AI Coding Agents Need a Language Server URL: https://dontcodethisathome.com/what-calls-this-function-why-ai-coding-agents-need-a-language-server Date: 2026-02-20 Description: Tree-Sitter tells you where a symbol is defined. It cannot tell you where it is called. That gap cost one pipeline run 33,000 tokens to find out. Tags: AI Engineering, Autonomous Engineering Pipeline, Architecture Dependencies: Language Server Protocol, Tree-Sitter, TypeScript The language server has a startup cost. One Coder retry costs more than that startup, in tokens, every single time. That asymmetry is why it ended up as an architectural requirement and not an optional optimisation. The economics are not close. ## What the pipeline used before The first version of the symbol indexer used regular expressions. It worked well enough to get started. Function definitions, class declarations, export statements: regex can extract those reliably from well-formatted code. For a proof of concept, it was sufficient. But it was brittle from the start and I knew it. A teacher told me something when I was learning to program that stuck: > If you are solving a problem with regular expressions, you now have two problems. The regex indexer was always a starting point, not a destination. Minified files, unusual formatting, nested constructs. Any of those could break a regex pattern silently, producing missing symbols or wrong line numbers with no error to investigate. Tree-Sitter replaced it. Tree-Sitter is a fast incremental syntax parser that produces a proper syntax tree rather than pattern-matching against text. It handles formatting variations, nested structures, and edge cases that would require an ever-growing set of regex patches to cover. It is also error-tolerant: it can parse files with syntax errors and still extract the symbols it can identify. The results go into a registry: a record per symbol with its location, type, and metadata. When the Planner reads a ticket, it queries the registry for a map of what exists and where, reasons about which symbols are relevant, and passes precise details for those specific symbols to the downstream agents. The registry is persistent and queryable, giving the agents a live index of the codebase [without reading entire files](https://dontcodethisathome.com/how-claude-p-silently-inflates-your-pipeline-token-costs). For most tickets (additive changes, new endpoints, new functions), that is sufficient. ## Where Tree-Sitter stops Tree-Sitter is a syntax parser. It reads one file's text and produces a syntax tree. It has no concept of what a name resolves to across files. It can tell you where `getAllPosts` is defined. It cannot tell you where it is called. It can extract `Post` as a parameter annotation. It cannot tell you what `Post` resolves to, or whether a change to it breaks three other files. This is not a gap in the implementation. It is a property of what Tree-Sitter is. A syntax parser does not do semantic analysis. That requires a language server. ![Tree-Sitter knows where getAllPosts is defined. The language server knows every file that calls it: posts.ts, search.ts, and api/posts/index.ts, leading to a known blast radius.](https://dontcodethisathome.com/images/lsp-vs-tree-sitter.svg) ## The question that breaks the registry approach The Planner's job, when decomposing an interface-changing ticket, is to determine blast radius: which other parts of the codebase depend on the symbol being changed, and which of those need to change too. For "add a comments endpoint", an additive change with no existing interface touched, the registry is sufficient. The Planner looks at what exists, decides what to create, produces a manifest. The check passes cleanly. For "change how `getAllPosts` returns data", the Planner must know what calls `getAllPosts` to determine whether a "update callers" sub-ticket is needed. Without that, it either [guesses wrong scope](https://dontcodethisathome.com/why-ai-coding-agents-fail-on-evolving-codebases) and leaves callers broken, or [over-scopes and touches files it should not](https://dontcodethisathome.com/correct-code-wrong-file-how-the-write-gate-contains-scope-creep). Neither is acceptable. > `textDocument/references` answers this directly. Send the server the file and position of a symbol, get back every location in the codebase that references it. Nothing else does this reliably across barrel re-exports, TypeScript path aliases, and index files. ## The economic case A Coder retry cycle costs more than it looks. A missed import breaks the type checker. Tests fail. The pipeline re-runs the full Coder prompt: reconstructed context, full generation cycle, another test run, potentially a Debugger pass. On a moderately complex ticket that is 20,000 to 30,000 additional input tokens and another 30 to 60 seconds of wall clock time. A language server diagnostic fix for the same import error: one local process request, zero tokens, one targeted patch, one test re-run. The test re-run is unavoidable. Everything else disappears. The blast-radius result from the first pre-LSP run on a pagination ticket is concrete. The Planner missed that `search.ts` was a caller of `getAllPosts`. The Coder changed the return type. Eleven typecheck errors. A [Debugger pass](https://dontcodethisathome.com/why-the-debugger-never-inherits-the-coders-reasoning). 92,618 tokens. 122 seconds. The same ticket with LSP active: no `search.ts` errors. Typecheck clean. The blast-radius failure did not occur. | | Pre-LSP | Post-LSP | |---|---|---| | Result | APPROVE | APPROVE | | `search.ts` typecheck errors | 11 | 0 | | `lib/posts.ts` modified | no | yes | | Debugger fired for | blast-radius type mismatch | logic bug, unrelated to LSP | | Tokens | 92,618 | 95,170 | | Runtime | 122s | 97s | The token counts are similar because both runs had a Debugger cycle, just for different reasons. In the pre-LSP run, the Debugger fixed a structural failure caused by a missed blast-radius caller. In the post-LSP run, it fixed a logic error: a validation check placed after a clamping function, so it never fired. That is a different category of failure, one that requires understanding call flow, not just knowing which files were touched. LSP cannot prevent those. That is still the Debugger's job. The blast-radius overhead in the pre-LSP run was 33,364 extra input tokens and 74 extra seconds, the cost of missing one caller. That overhead does not appear in the post-LSP run. > The pre-LSP Debugger spent its cycles on structural failures that LSP now prevents. The post-LSP Debugger spent its cycle on something it is actually for. ## What changes The language server starts at boot, not lazily. Decomposition needs reference data before any manifest is produced, so the server has to be ready at preflight, not on first use. A diagnostic pass runs between every Coder write and the test suite. The server checks the modified files for type and import errors before the tests run. Structural problems get caught and fixed before they reach the test suite. The test suite handles logic failures. The diagnostic pass handles structural ones. > They are different problems and should not share the same retry budget. The Debugger is now reserved for genuine logic failures. Import and type errors that previously reached it no longer do. ## What does not change Tree-Sitter and the registry stay. Bulk indexing, domain tagging, content hashing, the context assembler. None of that changes. The registry is still how the Planner navigates the codebase at the start of every ticket. The language server extends it, it does not replace it. The pre-write syntax gate also stays. Tree-Sitter is faster for checking whether a file is structurally valid before writing it to disk. That check runs in the same process, no network hop, sub-millisecond. Using the language server for that would be slower with no benefit. The test runner stays. LSP diagnostics are a pre-test filter. They catch structural problems early. They do not verify behaviour. That is still the test suite's job. ## What this does not solve Reference lookup prevents the specific failure where a caller is missed during blast-radius analysis. It does not prevent logic errors, scope misjudgements, or multi-file side effects that do not manifest as type errors. Those are still the Reviewer's problem to catch. The fixture for this work is a TypeScript blog API, which is small and well-structured. Whether the same approach holds on a larger or messier codebase is not yet tested. --- The registry answers "what exists and where." The language server answers "what calls this and what does it resolve to." Both are needed. The right question was never which one to use. It was when the second one becomes worth its startup cost. The first retry that hit a missed caller answered it. --- *The pipeline runs inside Docker on real tickets. Numbers are from actual runs against a TypeScript blog API fixture. Still R&D.* ================================================================================ # The Quality Gate That Passed When It Failed URL: https://dontcodethisathome.com/the-quality-gate-that-passed-when-it-failed Date: 2026-02-13 Description: A Haiku optimization made the L2 quality gate silently pass on every run. The fix was removing the LLM call entirely. Tags: Autonomous Engineering Pipeline, Testing, Architecture Dependencies: LLM APIs, TypeScript, Autonomous Engineering Pipelines The optimization made every metric worse. The L2 quality check (the gate that verifies every acceptance criterion in a ticket maps to at least one test) was running a small Sonnet call. Around 300 tokens in, eight tokens out. Swapping it to Haiku produced this: ```text test_quality_l2 in=6,866 out=5,027 31.8s WARNING: no closing fence — output may be truncated or malformed ``` 5,027 output tokens. 31.8 seconds. And the gate passed anyway, silently on every run, because the parsing code couldn't make sense of the output and treated a parse failure as a clean result. That is the shape of the problem. The rest of this post is about why it happened, what replaced the LLM call entirely, and what it says about where LLMs belong in a pipeline and where they don't. ## The gate The pipeline has a layered quality gate that runs between the Test Writer and the Coder. It checks the generated tests before any implementation code is written, and it operates in three tiers. L1 is about import scope. The tests must only import from files listed in the ticket manifest. This is enforced through a prompt rule rather than code, because static import analysis across TypeScript barrel exports and path aliases produces too many false positives to be useful. L2 is about coverage. Every acceptance criterion in the ticket must map to at least one test description. If a criterion has no matching test, the Test Writer is asked to retry before the Coder sees the file. L3 is about assertion depth. A test file full of `.toBeDefined()` calls is not useful. If the file contains only weak assertions and no strong ones, it fails. L1 and L3 are both deterministic. L1 is a prompt rule, L3 is a regex pass over the test content. L2 was the exception, and that is what this post is about. ![Pipeline quality gate flow: Test Writer feeds into L1 import scope check, then L2 criteria mapping check, then L3 assertion depth check, then Coder](https://dontcodethisathome.com/images/quality-gate-diagram.svg) ## The original design The problem L2 is solving is that acceptance criteria and test descriptions are often written in different language. A criterion might say: > Return 404 when the post is not found. The test might be called "handles missing resource." Those mean the same thing, but a keyword match would miss the connection. The decision was to use a small LLM call for the mapping. Send the list of criteria and the list of test descriptions, ask the model which criteria have no matching test, get back a JSON object. The prompt is around 300 tokens. The expected response is `{"missing": []}` or `{"missing": ["Return 404 when post not found"]}`. The check passes or fails, and either the Test Writer retries or the pipeline moves on to the Coder. When I ran the first successful proof run of a pagination ticket through the pipeline, the Sonnet call for this check looked like this: ```text test_quality_l2 in=6,857 out=8 3.6s ``` Eight output tokens. It returned `{"missing": []}`, the gate passed, and the pipeline continued. Total cost: negligible. Total time: 3.6 seconds out of a 68-second run. ## The optimization The reasoning was: this is a binary yes/no task. The model produces eight tokens. Sonnet is the most capable model in the pipeline and it is doing nothing interesting here. Switch it to Haiku. Same task, fraction of the cost, probably faster. The next run: ```text test_quality_l2 in=6,866 out=5,027 31.8s WARNING: no closing fence — output may be truncated or malformed ``` Haiku produced 5,027 output tokens for a task that should produce eight. It took 31.8 seconds. The warning in the logs meant the model had wrapped its output in a code fence and either failed to close it or produced content long enough that something downstream broke. What happened: Haiku did not follow the output format instruction. Instead of returning a compact JSON object, it produced a long analysis: > Let me examine each criterion in turn. The first criterion states that... The test descriptions include... Based on this analysis... Then, somewhere in that output, the actual JSON. Then more text. The parsing code caught the failure and treated it as a clean result. The gate passed. The pipeline continued to the Coder. The quality check had been silently disabled. ## The failure mode matters more than the failure Haiku performing worse than Sonnet is not the point. The shape of the failure is. When the L2 check fails to parse the model's response, the code treats it as a pass. The reasoning behind that design was to avoid false positives: if the model returns something unexpected, it is better to let the pipeline continue than to block it on a malformed response. That reasoning is sound when the model occasionally produces slightly malformed JSON on an otherwise correct response. It breaks down completely when the model consistently produces the wrong format, because then every run silently bypasses the quality gate. The pipeline logs show `[agent call]` for the L2 stage, a result appears, the gate passes. Nothing in the output tells you the check did not actually run. You would only know by looking at the token count and noticing 5,027 output tokens where you expected eight. The Haiku switch did not just cost more time and money than Sonnet. It made the quality gate invisible. > A check that passes when it fails is worse than no check at all, because no check at all is at least honest about what it is doing. ## Why the fix is not "use a better model" The obvious response is to switch back to Sonnet. That works. The Sonnet call returned the right format. But the problem runs deeper than model selection. The L2 check is answering a structural question: does every criterion have at least one test description that covers it? LLMs are good at understanding whether two sentences mean the same thing. They are less reliable at following strict output format constraints, and the failure mode when they do not is a silent pass rather than an explicit error. The right tool for this task is not semantic understanding. It is token overlap. ## Token overlap The replacement compares meaningful terms extracted from each acceptance criterion against meaningful terms extracted from each test description. A criterion is considered covered if it shares enough terms with at least one test description to suggest they are addressing the same thing. This approach has real limitations. A criterion phrased very differently from its corresponding test description will not match. But when acceptance criteria and test descriptions are both generated by the same pipeline for the same ticket, the language tends to be close enough that term overlap catches the obvious gaps. The failure mode is explicit. If a criterion has no matching test, the check fails with a message listing the unmatched criteria. There is no model output to parse, no JSON to validate, no fallback that silently promotes a failure to a pass. The check either finds a match or it does not. ## The threshold problem Shipping the token overlap approach immediately exposed another edge case. The initial threshold was calibrated to require more than one shared term, on the reasoning that a single common domain word is too coincidental to count as real coverage. That broke on the very next run. One of the ticket's criteria was a regression check: > Existing POST /api/posts behaviour is unchanged. The meaningful terms in that sentence are "existing", "behaviour", "unchanged", and "posts". Test descriptions for a pagination feature will not use "unchanged" or "behaviour". The only overlap with any pagination test was "posts": one term, not enough to meet the threshold. The check failed, the Test Writer retried, failed again, and the pipeline stopped. Dropping the threshold fixed it. The check now flags criteria with zero term overlap. A single shared domain word is enough to say "there is at least a test in this area." The regression criterion passes because any pagination test mentioning "posts" shares that term. This is a known weakening of the check. In a posts API, "posts" appears in almost every criterion and almost every test description. A criterion sharing only "posts" with a test is barely verified at all. But the alternative is false positives that block the pipeline on legitimate test files, which is worse. ## What this is and what it is not The token overlap check is a stopgap. It is better than the Haiku LLM call because it fails explicitly and runs in zero time. It is worse than the Sonnet LLM call because it cannot handle vocabulary differences between criteria and test descriptions, and the threshold calibration is fragile. The right long-term answer is probably embeddings. Compute a vector for each acceptance criterion and each test description, check cosine similarity, flag criteria with no test description above a similarity threshold. That handles synonyms, stemming, and phrasing differences properly without touching an LLM in the hot path. A small local embedding model would work fine for ten to thirty short strings. That is not built yet. For now, the check catches the obvious case: a criterion with no test in the same general area at all. The Reviewer stage, which reads the full diff and test output at the end, is the real backstop for coverage gaps that L2 misses. L2 is a cheap early signal, not a guarantee. ## What this run actually cost The comparative numbers for the same ticket, same fixture: | | Run 1 (Sonnet) | Run 2 (Haiku) | Run 3 (token overlap) | |---|---|---|---| | test_quality_l2 output tokens | 8 | 5,027 | 0 | | test_quality_l2 duration | 3.6s | 31.8s | <1ms | | Total runtime | 68s | 95s | n/a¹ | | Gate actually ran | yes | no | yes | ¹ *The run that validated the token overlap replacement hit a separate blast-radius failure: the Coder changed a function signature that broke two callers the Planner had not included in scope. That added a Debugger retry cycle and 74 seconds. It is a known pre-LSP problem, not a symptom of the L2 change, so the total is not comparable to runs 1 and 2.* Run 2 took 27 seconds longer, spent more tokens, and did not actually check anything. Run 3 spent no tokens, took no measurable time, and the gate ran correctly. The optimization made every metric worse. The replacement made every metric better. ## The generalizable part When you have a task that requires a specific output format and the penalty for a format violation is a [silent pass](https://dontcodethisathome.com/why-a-warning-is-worse-than-a-hard-stop), the model's tendency toward verbosity becomes a correctness problem rather than a quality issue. Smaller models tend to be more verbose, not less. They hedge more, explain more, and are less reliable at following strict structural constraints when those constraints conflict with their training toward helpfulness. The right question before adding an LLM call to a pipeline is not "which model is cheapest for this?" It is "what happens when the model gets this wrong?" If the answer is "the check passes anyway," that is not a check. It is a logged no-op with latency attached. For semantic matching between human-written text, LLMs are the right tool. For structural verification of whether a set of tokens overlaps with another set, they are not. The acceptance criteria mapping check is the second problem, dressed up as the first. --- *Numbers are from real runs against a TypeScript blog API fixture. The pipeline runs inside Docker with no human in the loop between ticket input and reviewed, committed code.* ================================================================================ # Why AI Coding Agents Fail on Evolving Codebases URL: https://dontcodethisathome.com/why-ai-coding-agents-fail-on-evolving-codebases Date: 2026-02-06 Description: Not a model capability problem. An agent with the wrong codebase version produces output that is plausible but wrong in ways that are hard to catch. Tags: AI Engineering, Autonomous Engineering Pipeline, Architecture Dependencies: AI Agents, Code Indexing, Software Architecture A codebase that works today is not the same codebase an AI agent was trained to understand. It has been refactored, extended, renamed. Functions that used to live in one file now live in three. An interface that used to take two arguments takes five. The agent does not know this. It works from whatever it last saw, or from a general model of how code tends to be structured, and it produces output that is plausible but wrong in the specific ways that only someone who knows this particular codebase would catch. > The model is not failing because it lacks intelligence. It is failing because it is reasoning from an approximation of the codebase: part context window, part trained intuition about how code tends to be structured. Neither is the same as knowing where things actually are right now. I started building a pipeline to see if this could be fixed. The core idea is that before any agent writes a single line of code, the system builds a [live index](https://dontcodethisathome.com/what-calls-this-function-why-ai-coding-agents-need-a-language-server) of every symbol in the codebase: every function, every type, every interface, the file it lives in, the line it starts on. This index is rebuilt before every task. Agents do not guess at where things are. They query the index and get a precise, current answer. The token cost of a smaller context is a side effect of this, not the goal. The goal is correctness. A precise context produces better output than a large one, on the same model, every time. The other thing I knew from shipping real software: the environment has to be clean before anyone starts work. > You would not onboard a new engineer onto a project where the tests are already failing. The pipeline applies the same standard. Before any agent runs, the baseline is confirmed green. If it is not, the pipeline stops. No agent time, no API cost, no output to review. The problem was already there before the pipeline touched anything. These two ideas, precise context and a clean starting state, are where the project began. Whether they hold up at scale is what the rest of this series is about. --- _Still R&D. The project is running on a test fixture. The harder tests are ahead._