What changed
Two days after the zero-LLM-code-generation post, the same ticket shape ran again on the AI coding agent pipeline. It cleared every gate on the first loop attempt, same as before, but with three refinements that update the prior post’s claims.
The two “filed but not done” items from the original post are closed. The operator’s design decisions (variant scope, gate location, response shape, placement) now flow through typed fields on the structured handoff instead of free-form question-and-answer text. The grounding chat stage that authors the human-facing ticket prose, which the original post did not yet know had an LLM hallucination surface, now has a four-layer structural defense against drift on resolved decisions.
The pipeline now structurally captures the operator’s intent at chat time, not just the implementation at op-handler time.
The Coder still does not fire. The total cost went up by 26 cents per ticket. The result is strictly more correct.
The numbers
| Metric | Original post (2026-05-18) | This run (2026-05-20) |
|---|---|---|
| Result | FULL_COMPLETION, 6 of 6 ticket tests | FULL_COMPLETION, 5 of 5 ticket tests |
| Full target suite | 959 of 959 | 958 of 959 plus 1 skipped, environment-conditional (LaTeX-renderer integration test, skipped when the compiler binary is not installed, pre-existing) |
| Coder LLM cycles | 0 | 0 |
| Test cases in generated ticket | 6 (2 pairs functionally duplicate) | 5 (all distinct; the duplicate-dedup pass landed) |
| Dead-guard pruning observable | Silent (verify by counting guards in the diff) | Stderr log line in production identifying the dropped guard |
| Operator decisions captured as typed handoff fields | 0 (free-form Q&A text only) | 4 (variant scope, gate location, response shape, placement) |
| Chat LLM cost | $0.194 | $0.320 (full interactive chat) |
| Ticket-generation LLM cost | $0.261 | $0.232 |
| Loop LLM cost (Test Reviewer + Impl Reviewer only) | $0.048 | $0.081 |
| Combined cost (chat + ticket-generation + loop) | $0.503 | $0.633 (+26%) |
One note on the prior post’s numbers: the zero-LLM-code-generation post reported a total cost of $0.308 under the label “chat + ticket-generation + loop.” That figure covered ticket-generation ($0.261) and the loop ($0.048) only; the chat cost ($0.194) was paid in the earlier session that authored the structured handoff and was omitted from that post’s total. This post uses a fully-loaded $0.503 baseline for the same run to make the comparison meaningful.
Iterations to green this round: one chat-prompt tightening, one accepted chat run, one ticket-generation run, one loop run. Down from seven, because the structural fixes from the original post compound.
What the four decisions produced
The operator answered four design questions in a chat. They did not open a file. They did not locate the schema definition, identify which variants existed, find the dispatch function, or determine which line in each branch was the first side-effecting call. The pipeline extracted all of that from the codebase and used the operator’s answers to decide what to do with it. The operator’s contribution was design intent: scope to all three variants, place the field at the top level, gate in the dispatch function, return the already-fetched record. Everything downstream was mechanical.
Here is the entire committed diff, anonymised. The endpoint is POST /api/items/actions; the request schema is a discriminated union with three variants; the dispatch function branches over the action discriminator.
--- a/src/server/api/routes/items.ts
+++ b/src/server/api/routes/items.ts
@@ const itemActionRequestSchema = z.discriminatedUnion("action", [
+ dryRun: z.boolean().optional(), // variant A, top-level sibling
+ dryRun: z.boolean().optional(), // variant B, top-level sibling
+ dryRun: z.boolean().optional(), // variant C, top-level sibling
@@ type ItemActionExecutionOptions = {
+ dryRun?: boolean;
@@ async function executeItemActionForItem(
if (action === "variant-a") {
+ if (options?.dryRun) return { itemId, ok: true, item };
await sideEffectA(item, ...);
} else if (action === "variant-b") {
+ if (options?.dryRun) return { itemId, ok: true, item };
await sideEffectB(item, ...);
} else {
+ if (options?.dryRun) return { itemId, ok: true, item };
await sideEffectDefault(item, ...);
@@ itemsRouter.post("/actions", async (req, res) => {
+ const { dryRun } = parsed;
+ ...(dryRun !== undefined ? { dryRun } : {}),
Eight added lines. An engineer with exact working knowledge of which files to open, which schema variants exist, and where in each branch the first side-effecting call lands writes this in five minutes. In a ~100k line codebase that knowledge is rarely instant, even for engineers who built the code. That is exactly the point, and it is worth being precise about why.
The value of this run is not the eight lines. It is that every one of them fell out of structured data with no LLM authoring a single character, and that the specific shape of those lines encodes the operator’s four design decisions made earlier at chat time. The field sits at the variant top level rather than nested inside the existing options object, because the operator answered the placement question that way. It appears on all three variants rather than one, because the operator scoped it to all three. The guards live in the dispatch function rather than threaded deeper into the side-effecting callee, because the operator chose that gate location. The guard’s return expression reuses the already-fetched record rather than issuing a second database read, because the operator chose that response shape.
The reason a small diff is the interesting case, not the trivial one, is that small structurally-repetitive changes are where an LLM coding agent wastes the most money relative to the value it adds. Three near-identical guard insertions across three branches, each of which must land before the first side-effecting call in that branch and nowhere else, is precisely the kind of edit an LLM gets subtly wrong (one guard placed after the side-effecting call, one variant of the schema missed) and bills a full authoring cycle for. Producing this diff deterministically, correct across every branch and every variant, verified green, with the operator’s intent structurally encoded, is the work.
Refinement one: the two open items closed
The original post named two open quality items.
The first concerned the body short-circuit op that places dryRun guards before each side-effecting call in the dispatch function. The op’s algorithm was structurally correct (each branch independently guarded against its first side-effecting call), but it produced a redundant guard inside an inner branch whose entry point was already dominated at runtime by an earlier branch’s guard. The post called this “filed but not done.”
That work landed. A dominator-pruning pass identifies guards made unreachable by an earlier guard in the same function and drops them. The pruner now emits a structured stderr line naming the dropped insert so the effect is observable in production logs, not only by counting guards in the post-edit diff. In the new run the line fired exactly once, on the same dispatch function and the same dominator pattern the original post described.
The second concerned the deterministic materializer producing duplicate scenarios. The materialised test file in the original run contained six test cases, two of which were functional duplicates of two others (same request body, same assertions, different prose labels). The post called this “filed but not done.”
That work also landed. The dedup pass normalises test cases against their structural content rather than their prose labels and drops cases that are duplicates or strict subsets of others. In the new run the dedup ran. It emitted no drops, not because the dedup was a no-op, but because the prose-authoring LLM emitted three distinct cases this time and a coverage-expansion pass expanded them to five cases covering distinct branches. The dedup correctly identified all five as already distinct. The reproducer pattern from the original post (six cases, two pairs duplicate) would now be collapsed to four cases automatically.
Refinement two: the operator’s design decisions are now structured data
The original post said the grounding chat “turned operator design decisions into the structured handoff the downstream stages read as ground truth.” That was partially true. The chat captured a free-form question-and-answer audit trail (question text, operator response, resolved option), but the typed fields on the structured handoff that downstream stages read were narrower. Some fields were already typed; the design decision fields were not. They were either implicit in the prose, inferred downstream from prose, or left to the LLM at the prose-authoring stage.
This was a real failure surface. A reproducer surfaced two days after the original post went out. The operator picked “add the flag to all three variants of the discriminated union,” but the chat’s prose-authoring LLM produced text saying “scoped to only one variant,” anchoring to the LLM’s own pre-override recommendation rather than the operator’s resolved answer. The chat’s downstream consumers read the prose. Without typed fields capturing the resolved variant scope, the pipeline would have authored a ticket for the wrong scope.
The grounding chat now writes four typed fields onto the structured handoff: the variant scope (which variants of the discriminated union accept the new field), the placement (which sub-object property to nest the new field under, or empty for top-level), the verbatim resolved option for each design decision, and the full candidate set the chat showed for each decision (persisted for audit).
Two new deterministic steps surface design decisions the LLM-driven design-review step might miss: one that activates on schema-extension ops when variant scope has not yet been captured, one that activates when the target schema has an inconsistent nesting pattern across variants and the placement choice is not yet captured. Both derive their options from the schema abstract syntax tree (AST), capture the answer in the typed handoff field, and feed the downstream populator deterministically.
A feasibility check at ticket-generation time verifies the operator’s placement choice is compatible with the variants in scope. If the operator picks “nest under the existing options object” with scope “all three variants” but only one variant declares that object, the check rejects the ticket at emit time with a structured error naming the offending variants. The original post had no such check. An incompatible combination would have surfaced as a failure downstream, with no structured recovery.
The chat now asks four design questions instead of the original session’s three. The fourth is the deterministic placement review.
Refinement three: the chat-authoring drift surface is structurally closed
The original post described the chat as producing the operator’s structured handoff. It did, but the prose half of that handoff (the human-facing refined ticket) had an LLM hallucination surface the post did not yet know about. The reproducer surfaced in two consecutive runs where the operator overrode the LLM’s recommendation. The LLM authored prose contradicting the operator’s resolved answer, anchoring to its own pre-override pick. The pipeline’s downstream consumers read the structured handoff, not the prose, so the pipeline behaviour was unaffected. The prose drift was a correctness issue for human readers and a load-bearing risk for any future stage that reads the prose for diagnosis.
The fix is four structural layers, each closing the drift surface at a different cost level.
The primary layer is prompt redaction. The chat-authoring prompt now shows the LLM only the questions resolved in design review, not the resolutions. The LLM cannot drift on content it cannot see. This cuts the semantic-drift surface at its source.
The second layer is a deterministic post-LLM rewrite. After the LLM authors prose, a deterministic step scans for references to options the operator did not pick and removes them. The matching operates against the stored candidate list rather than natural language heuristics.
The third layer is a hard validator backstop. It re-runs the structural drift check on the rewritten prose. If anything survives the rewrite (a rewrite edge case), the chat raises a structured exit naming the violation. Under normal operation this never fires; when it does, it points at a structural bug.
The fourth layer is a deterministic appendix. The chat appends a verbatim “Resolved Design Decisions” block to the human-facing refined ticket, rendered from the typed decision records. The LLM’s narrative cannot contradict resolutions it cannot see; the appendix is the structural source of truth that humans cross-check against.
In the new run, the LLM-authored narrative used the deferred form for every resolved topic. Each sub-ticket’s description text says things like “the variant scope is resolved in the Resolved Design Decisions section below” or “the placement of the field relative to existing properties is resolved in the Resolved Design Decisions section below.” The four-decision appendix at the bottom carries the resolutions verbatim. The drift surface is gone.
What stayed LLM
The same three stages as the original post: the grounding chat (which now includes design review and prose authoring), the Planner (test plan and sub-ticket decomposition), and the two Reviewers (Test Reviewer over the test plan, Impl Reviewer over the final diff). Both Reviewers approved on the first cycle, same as the original.
The chat got more elaborate, and its cost rose from $0.194 to $0.320. The additional cost came from three sources. The redaction block lengthens the prose-authoring prompt. The new placement-review deterministic step adds no LLM cost itself, but is preceded by an LLM-driven design review that emits one extra question on average. The chat now loads the same shared configuration at the design-review and prose-authoring stages that the Planner, Coder, Debugger, and Reviewer stages already use.
The ticket-generation cost dropped 11% in the same direction. The ticket-generation stage detected the structured handoff’s typed fields and skipped an upstream inference pass it would otherwise run. A cleaner handoff means fewer downstream LLM calls.
The honest trade-off
The combined cost went from $0.503 (the original session’s chat plus the second session’s ticket-generation plus loop) to $0.633 (this session’s chat plus ticket-generation plus loop). That is plus 26%, or plus 13 cents per ticket. The increase is not a regression. It is a deliberate trade.
What it buys: Four design questions get captured structurally from a guided conversation. The drift surface in the only remaining LLM-authoring stage is closed. The pipeline’s failure modes shrank. The reproducer that ran two days after the original post (incompatible variant-scope and placement combination, op apply-time failure, Coder fall-through) is structurally impossible now. The operator’s placement choice is captured upfront, the feasibility check verifies compatibility pre-emit, and the schema-extension op only sees combinations the operator and the AST both agree on.
What it costs: 13 cents per ticket, all in chat tokens, mostly cache writes on first-time prompt shapes that will reduce on subsequent similar runs.
The trajectory is the same as the original post described. Each structural fix removes a class of downstream LLM fallback rather than tuning a prompt rule. The fixes compound, because once a behavioural shape is structuralised, it stays structuralised. This round of fixes happened at the chat layer specifically, which the original post had treated as already structurally tight because the operator was in the loop. The drift reproducer proved that was not quite true. The four-layer defense plus the typed handoff fields plus the two new deterministic chat steps push the chat layer closer to the structural floor the implementation layers already cleared.
What this still doesn’t mean
The original post’s caveats hold. One ticket shape. One project. Tree-Sitter and a language server (LSP). The pipeline is still R&D. The Coder is still not retired. None of it fired in this run. It will fire on tickets whose structural floor is lower than this one.
The cost per ticket of this shape is now bounded by the chat, the Planner, and the verification Reviewers, with the operator’s design decisions flowing through typed data rather than free-form prose.
The pattern from the original post (the structural floor rises, the LLM’s responsibilities shrink, the failure surface contracts) still holds. The floor rose another layer this round.
The pipeline runs against a real ~100k line TypeScript monorepo under Docker with Tree-Sitter, a language server, and Vitest. Still R&D.