From LLM Luck to Structurally Guaranteed: One Ticket Across Four Architectural Eras

What this post is

Seven runs of the same ticket, spaced across a month of architectural work on the autonomous engineering pipeline I am building. Each run is captured in a per-run loop log with cost data, run IDs, and a description of what landed in the pipeline since the previous green. Read in sequence, the seven runs document four distinct architectural eras of that pipeline, and they trace a specific shift: each era removed one more decision from the LLM’s responsibility and replaced it with something the project’s existing machine-extracted facts could compute on their own.

In plain terms: each era is a deliberate effort to strip the LLM out of decisions where its hallucination surface has nothing to do with language understanding and everything to do with structural derivation. LLM hallucination on test authoring, on assertion targets, on schema-shape derivation, on deduplication identity: each of these is a place the pipeline used to ask the model and now answers from data.

The work is still R&D. The pipeline runs against a ~100k line TypeScript monorepo using Tree-Sitter and a language server (LSP) for the project’s structural facts; the trajectory is real but not a benchmark claim.

The headline numbers:

Date	Model	Per-stage effort	Cost	Tests	$/test	Era
2026-04-08	Opus 4.6	medium	$3.04 (incomplete)	n/a	n/a	Opus comparison
2026-04-10	Sonnet 4.6	low/medium mix	$1.538	4/4	$0.385	Wiring patcher
2026-04-13	Sonnet 4.6	low/medium mix	$1.024	4/4	$0.256	Single-write-mechanism
2026-04-14	Sonnet 4.6	low/medium mix	$1.114	4/4	$0.279	Mid-pipeline-fix
2026-04-20	Sonnet 4.6	low/medium mix	$0.587	4/4	$0.147	Deterministic materializer
2026-05-06	Sonnet 4.6	low/medium mix	$0.543	4/4	$0.136	Validator + transformer
2026-05-08	Sonnet 4.6	low/medium mix	$0.666	9/9	$0.074	Discriminating coverage

Each pipeline stage carries its own model and effort configuration; “low/medium mix” above means stages on the Brain layer (Planner, Ticket Generator) and the Coder run at medium while Reviewers, prose, and routing stages run at low. The configuration is the topic of an earlier post. The April 8 Opus row is a one-off cost comparison, cancelled mid-Coder, included here for contrast against the Sonnet baseline.

Per-test cost dropped from $0.385 to $0.074. Total cost dropped from $1.538 to $0.666. The test count nearly doubled at the end, not because the ticket got harder, but because the architecture started exercising coverage that earlier runs had silently lost.

The cost trajectory is the second-order signal. The first-order signal is what each era removed from the LLM’s responsibility, and what that did to the pipeline’s correctness guarantees.

The ticket

The ticket threads a new optional flag through four layers of a real production codebase: a Zod discriminated union with per-variant options, two anonymous closures inside route handlers (a regular POST handler and a streaming POST handler, each parsing the request through a slightly different path), a 140-line business-logic function with branching dispatch, and a pipeline orchestrator that must conditionally skip a downstream call.

Target files: a 1,387-line route module and a 480-line orchestrator module. The work is not greenfield. It is surgical editing inside large existing modules: leaving every symbol the ticket does not touch byte-identical, preserving the surrounding control flow, and not rewriting code that already does the right thing. The patterns inside those files are the ones that break code generation: discriminated unions, conditional spreads, anonymous callbacks inside asyncPool, two separate request-parsing paths for the same endpoint. The test file requires vi.mock of the entire pipeline module with eight named exports.

This is not a “give the LLM the file and trust it” ticket. The work asks two questions: does the data flow correctly across four layers, and does the assertion check the observable thing rather than the wrong thing. It is the right test case for an autonomous pipeline because it requires both mechanical rigour and structural subtlety: the data flow has to be correct, and the test has to assert against the right symbol in the right module.

Era 1: The wiring patcher (Sonnet 4.6, low/medium effort, $1.538, 4/4)

The first green run came after about seventy red attempts spread across two weeks. The breakthrough was not “make the Coder write better code.” It was structural: stop asking the Coder to write the anonymous-closure edits.

The pipeline’s wiring patcher is a deterministic structural-edit step that handles a class of repeated-shape insertions the Coder used to be asked to author. When the manifest stripped the closure references from the Coder’s instructions and the wiring patcher took the closure-level edits downstream without an LLM call, the run went green on the first attempt.

The lesson, in the run’s own words: “The Coder writes symbols; the pipeline handles everything else deterministically.” This was the first instance of the Lego Instructions Principle materialising in code. Before it, the Coder was being asked to make decisions that downstream Tree-Sitter analysis could make better.

What the LLM still owned at the end of Era 1: test authoring (Test Writer), test updating per sub-ticket (Test Updater), assertion targeting, mock setup, and almost every structural decision inside the test file. The pipeline had a deterministic write path for a narrow class of code edits and nothing else.

Era 2: Deterministic materializer (Sonnet 4.6, low/medium effort, $0.587, 4/4)

Three consecutive successful runs with byte-identical materialized test files. Output tokens dropped 79% versus the previous era because the LLM Test Writer and Test Updater stages had been retired entirely. Earlier posts on this blog describe both as active LLM stages; both were replaced in this era by the deterministic materializer described below.

In their place: a deterministic materializer that takes a structured test plan (fixtures, mocks, structured scenarios, and acceptance criteria) and renders it to TypeScript via Tree-Sitter. The plan is authored upstream; the rendering is mechanical. Three runs of the same input produce the same output, every time, to the byte.

Era 2 is also where the Era 3 retrospective lands hardest. From the later loop log, looking back at the Era 2 run:

The earlier 4/4 passing tests were a fact of the LLM happened to author observable assertion targets and correct mock import paths on this ticket variant. The architecture had no validator or transformer to catch a structurally invalid assertion before render time.

The 4/4 was not a vindication of the architecture. It was a vindication of the LLM’s authoring on this specific ticket variant. If the LLM had hallucinated a different assertion target, one that compiled and looked plausible but pointed to a function the call chain never actually reached, the test would still have rendered, type-checked, and shipped. It would have passed at runtime or failed silently, and either way the pipeline had no structural defence against it.

The deterministic materializer eliminated authoring variance. It did not eliminate authoring correctness as a problem; it just stopped the problem from masking itself behind LLM non-determinism. Whatever the upstream test plan said, the renderer faithfully turned into TypeScript. If the upstream test plan had a wrong assertion in it, you got a wrong assertion in the rendered file with run-to-run consistency.

Deterministic materialization is structural progress, but it is not yet a structural guarantee.

Era 3: Validator + transformer (Sonnet 4.6, low/medium effort, $0.543, 4/4)

The variant changed. A second variant of the same ticket family ran with a deeper call chain and eight sub-tickets instead of three. Different scenario shape, different mock topology, different observable boundaries. The Era 2 architecture should have produced a clean run on it. It did not.

The previous attempt on this variant, two days earlier, failed with two unreachable assertions at runtime. The LLM had authored assertions targeting leaf functions directly, but each leaf was internal to its module and not exposed through the mock target’s import path. Both assertions compiled, both tests ran, and both spies were attached to symbols the runtime call chain did not pass through. Each failure looked like a code bug but was something else: a test that did not test what it claimed to.

The Era 3 work landed a validator-and-transformer pass at ticket-generation time. The validator runs against the project’s call graph and mock topology to check that every assertion the LLM authored will actually observe something at runtime. Where it finds one that won’t, the transformer rewrites the assertion to target a function the test’s actual call chain does pass through.

The transformer rewrites them. For the run that produced this Era 3 green, two transforms fired:

unreachable assertion: import path resolves to wrong module
  [rewrote] case 4: 'processItem' → '@server/index'

unreachable assertion: target not observable from test entry
  [rewrote] case 4: 'processItem' → 'runPipeline'

The LLM authored what it would have authored anyway, and the pipeline rewrote the assertion before the materializer saw it, so the runtime ran clean.

The Era 3 run is the first where the test passing was not LLM luck. The same architecture would have caught the earlier variant’s failure modes too, if any had fired. They did not fire because the LLM’s authoring happened to align with what the project’s mock topology permitted on that variant. The deeper variant flushed the LLM-luck dependence into the open by being the variant where alignment did not happen.

Six of eight sub-tickets in this run committed without the LLM Coder being invoked. They shipped through structured operations: “add a field to a schema variant,” “add a short-circuit guard at a function entry,” “add a field at caller sites,” each a deterministic Tree-Sitter transformation expressed as a manifest op. The Coder LLM ran twice, once for each genuine type extension. Everything else was the pipeline executing instructions, not interpreting them.

Era 4: Discriminating coverage (Sonnet 4.6, low/medium effort, $0.666, 9/9)

The Era 3 architecture was correctness-preserving. It was not coverage-discriminating.

The threading is a flag. With the flag on, the pipeline short-circuits; with it off, the pipeline proceeds normally. A test suite that exercises only one of those values can pass even if the implementation hardcodes that value internally. A test asserting the short-circuit fires when the flag is true proves nothing about whether the chain actually threads the flag, because a hardcoded if (true) body would also pass it.

Era 4 added bivalent threading twins: every threading-hop acceptance criterion now generates two cases, one per flag value. The test count rose from four to nine for the same ticket, the same correctness, the same implementation.

Five fixes shipped together, all the same anti-pattern. One was a search step using a permissive heuristic where exact identity was available in the project’s existing data. One let the LLM guess a count the test plan already pinned. One missed set-semantics in a deduplication pass and produced silent duplicates downstream. One compared a lossy intermediate representation where the full structure was available, collapsing distinct cases into one. One asserted on the wrong observable target where the manifest already named the right one. Each fix was the same shape: a heuristic where the data already had the answer. Each fix removed an entire class of latent failure, not just the specific instance.

The combined run cost $0.666. The marginal cost over Era 3’s run was $0.123, of which $0.054 was the new Test Reviewer LLM stage and $0.038 was the new Impl Reviewer LLM stage. The marginal correctness signal was substantial: Era 4 produces evidence for both flag-value branches; Era 3’s run could not have caught a hardcoded-false-skip implementation.

The pattern

Each era removed one more LLM dependence:

Era 1 removed write mechanics for a class of structural code edits. Tree-Sitter handles anonymous closures.
Era 2 removed test authoring entirely. A deterministic materializer renders structured test plans to TypeScript.
Era 3 removed structural correctness of LLM-authored test plans. A validator detects unreachability; a transformer rewrites it.
Era 4 removed lossy intermediate representations from deduplication, LLM-guessed thresholds from gates, and heuristic searches from operation locators.

The principle is not “isolate the LLM” or “give the LLM better context.”

Wherever a decision is mechanically computable from machine-extracted facts, the pipeline must compute and apply it, not surface the facts and trust the LLM to apply the rule.

Every era is the same move applied to a different decision. That move is what the Data Path Principle names.

The cost trajectory is real. Per-test cost dropped about five times across the seven runs. But the cost is downstream. What changed first, in each era, was the architecture’s defence against being wrong in a particular way. The cost dropped because the pipeline stopped asking the LLM to do things it could do itself.

What this is not

The seven runs are not a benchmark claim. They are the same ticket family on the same TypeScript codebase. The trajectory is real evidence; it is not a leaderboard position.

This is not a linear-progress story. The third green cost slightly more than the second because a cumulative-diff Reviewer addition expanded the Reviewer’s input. Architecture changes have second-order cost effects, and not all of them are reductions.

The architecture is not eternal. The Era 4 work hardened one class of latent failures; the broader anti-pattern, heuristic where derive is possible, will keep producing instances. The principle is the durable artefact; the specific fixes are not.

The pipeline is not complete. The LLM still authors per-case intent, the ticket prose, and the code inside type extensions. What changed across these four eras is not “the LLM has been removed.” It is “the LLM is no longer responsible for decisions the data already answers.”

The headline number is per-test cost dropping from $0.385 to $0.074. The structural number is what the LLM stopped being asked to do.

The pipeline runs inside Docker on real tickets against a ~100k line TypeScript monorepo. The seven runs in this post are drawn from the pipeline’s run archive across April and May 2026. Still R&D.