Raising the Floor under the Coder: Generalizing an AI Coding Agent across Ticket Types

12 MIN READ

An earlier post on this pipeline made a deliberately narrow claim. The pipeline had reached the point where it could ship a production feature with zero LLM-generated code, but only for one ticket shape, and that result had not generalized. “The trajectory has not generalised across ticket shapes,” it said: the same threading-flag ticket was overrepresented in the testing, and a ticket that needed a new algorithm, a different schema, or a different framework would still fire the Coder. That post was honest about being one data point.

This is the follow-up, and it supersedes that limit. Over the following weeks the code generation pipeline went from that single ticket shape to six distinct ticket shapes that build end to end, and the zero-Coder result held for four of them. The two shapes that introduce genuinely new code still call the Coder, which is exactly where the earlier post predicted the boundary would sit. Between June 9 and June 16, 2026, five new shapes each reached their first full completion: a flag that cascades across two call hops, a new flag that collides with an existing one, a removal, two independent flags on one chain, and a brand-new endpoint built from scratch. Six shapes is not the result worth noting. The second through sixth came fast because the expensive work was front-loaded into a single idea: move more of each shape below the line where the Coder is needed.

A companion post tracked one ticket across four architectural eras, which is the depth axis: one shape made progressively more rigorous until it was structurally guaranteed. This post is the orthogonal axis: same discipline, applied to widening the range of shapes rather than deepening one.

A five-rung ladder of ticket shapes from additive flags at the bottom to rename/refactor at the top, each labelled with a first-completion date in June 2026 and whether it built with zero Coder cycles or one, with rename/refactor greyed out as deferred. A side caption reads "the breadth axis; the depth axis is one shape across four eras."

Why one ticket shape does not generalize to the next

Each ticket shape is a different decomposition, not a different parameter. Additive flag threading walks a call chain forward and adds an option at each hop, then proves the option reached the boundary and changed what the function does. A removal walks the same chain backward and takes the option out, and its test has to prove a call no longer carries the removed key, which is a different kind of assertion entirely. Multi-flag is N of those threads on one chain at once, with the extra burden of proving the flags do not interfere. A greenfield endpoint has no chain to walk, because the code it threads through does not exist yet.

A pipeline tuned to one of these can be completely unable to build the next, and the tempting fix is to special-case each shape at every stage it touches: a branch in the Planner, a branch in the test generator, a branch in the operation handlers. That is the move that turns a pipeline into a pile of conditionals. The cost is not the first special case. It is that every stage now carries a per-shape branch, and the next shape multiplies through all of them. For anyone who has maintained a system where the special cases outnumber the rules, that is the failure mode to design against, because it is the one that looks like progress while it is happening.

What “raising the floor under the Coder” means

There is a line in this pipeline below which work is deterministic. Below the line, changes are made by structured operations: a typed instruction that names a function, a field, and a call, and drives a machine edit with no model authoring anything. Above the line, the Coder writes code. Call that line the floor.

Generalizing across shapes was not a matter of teaching the Coder more shapes. It was moving more of each shape below the floor, so the Coder was needed less or not at all. Four moves did most of that work, and each one is a generalization rather than a feature.

The first move was making the ticket’s intent carry a direction, add or remove, as a fact separate from the structure of the change. Once removal is a direction rather than its own ticket type, the Planner walks the chain backward and emits inverse operations, and every downstream stage treats the result like any other ticket. Removal stopped being a second pipeline and became the subtractive dual of threading.

The second move was making the unit of work a list. A ticket used to carry one target field; now it carries a list of them, and each field runs through the same single-field machinery once. Multi-flag is not new code. It is the one-flag path run N times, with a merge step that combines the sibling edits by resolved symbol identity rather than by name. Merging by name is the bug; two fields that happen to share a name are not the same field. This is the Data Path Principle applied to decomposition: the registry holds symbol identity, so the merge keys on identity, not on a string that can collide.

The third move was extending the registry as the shapes demanded new facts. Deterministic resolution only works when the machine fact it needs is already stored, and new shapes needed facts the registry did not yet have: the guard conditions already present on a call edge, so a new flag colliding with an existing one is detected rather than guessed; the field types behind a named return type, so the success branch is resolved rather than assumed. The registry is rebuilt on every sync, so extending it is cheap, and each new fact kept one more shape below the floor instead of pushing the decision up to the Coder.

The fourth move was typing the operations themselves. An operation that threads a field and one that creates a file are different types, and the Planner dispatches on the type. Greenfield create added a new operation class without disturbing the others. The create path emits a handler skeleton and registers the route deterministically, and only the handler body crosses the floor to the Coder.

A diagram with a horizontal deterministic-floor line. Four ticket shapes (additive threading, colliding-flag, removal, multi-flag) sit entirely below it in the no-LLM region; cascade and greenfield each have one block above the line, labelled "short-circuit return body" and "handler body", representing the new code the Coder authors.

The ladder: a curriculum, not a backlog

The shapes were not attempted in arbitrary order. They were climbed as a ladder, each rung chosen so it reused the rung below and exposed exactly one new gap. Rung 1 is additive flags, in three increasingly awkward variants: a flag that skips one call, a flag that cascades to skip a second call in a parent function, and a new flag that lands on an options object already carrying a flag that gates the same call in the opposite direction. Rung 2 is removal. Rung 3 is multi-flag. Rung 4 is greenfield create. Rung 5, rename and refactor, is not built.

Each rung was attempted only after the one below was proven, and each was filed with a close condition before any code was written, in the same gap-catalogue discipline that governs the rest of this project. A rung is done when a real ticket of that shape reaches full completion, with the generated test failing before the change exists and passing once it does (genuine confirm-RED) and both Reviewers approving, not when the code looks plausible. A ladder built this way is a running bet about which risks are safe to defer. Rung 5 and a few harder shapes are deferred on purpose, with written triggers, rather than half-built and carried as silent debt. That is a deliberate choice about where to spend effort, and it is the part of this work that maps most directly onto running a team rather than writing code.

What each new shape broke

The value of a rung was rarely the green run. It was the gap the green run forced into the open.

Removal exposed that the entire oracle vocabulary was built to prove a call happens. A behavioral oracle for an additive change asserts that the new behavior fires. A removal has to assert the opposite: that a call no longer carries the removed option, and that the option is gone from the body. A “do nothing” implementation passes a presence test trivially, so without a subtractive class of oracle, a removal that deletes nothing is a green run. The pipeline grew that class, asserting the call is made without the key and the key is absent. The first removal reached full completion in mid-June: eight sub-tickets, every one committed by a structured operation with zero Coder cycles, confirm-RED genuine on all eight assertions, full target suite at 961 passing.

A two-pane code comparison. The top pane adds a force key to a call and asserts the call was made with it; the bottom pane removes the key and asserts the call was made without it, showing removal as the inverse of threading with the assertion flipped.

Multi-flag exposed name-versus-symbol. Two independent flags on one chain produce sibling edits that have to be merged, and merging by field name silently fused edits that only shared a name. Merging by resolved symbol identity fixed it, and the shape needed an assertion no single-flag shape did: that flag A does not change flag B’s call, an independence cross-assertion generated alongside the per-flag tests. The first multi-flag run carried two flags through one chain to two different calls, with ticket-test at 37 passing, the behavioral oracle at 6 passing including the cross-assertion, the full suite at 970, and zero Coder cycles.

Greenfield exposed the floor’s ceiling, honestly. There is no chain to thread and no existing code to anchor a deterministic edit against, so the handler body is genuinely new logic, and new logic is exactly the work the floor cannot do. Greenfield is the first shape where the Coder authors a body. Everything around the body still runs deterministically: the pipeline emits the handler skeleton, registers the route on the central router, derives the response oracle from the schema’s own column defaults so the test seeds one record into every status, and runs confirm-RED before the body exists. The greenfield endpoint, a route returning a count of items in each status, builds to full completion with the handler authored by the Coder, ticket-test at 3 passing, and the full suite at 956.

What still fires the Coder, and what is not built

Two of the six shapes still call the Coder, and both calls are honest. Greenfield’s handler body is new logic. The cascade’s secondary guard writes a short-circuit return that did not exist before, which is also new logic rather than a structural edit to existing code. Both match the earlier post’s caveat precisely: a ticket that needs a new algorithm fires the Coder. The zero-Coder floor covers the four shapes where every change is a structural edit to code that already exists, which are the plain skip flag, the colliding-flag variant, removal, and multi-flag.

Three shapes are not built. Rename and refactor is Rung 5, deferred. A single ticket that touches multiple independent call chains is unsupported, because the intent model still assumes one chain per ticket; the seam for it exists but is not deployed. File deletion has the operation but no ticket shape to drive it. Each is filed with a concrete close condition rather than attempted speculatively.

And one limit is structural rather than deferred. When a gated call sits in the same file as its caller, no unit oracle can observe it under the test framework’s module model, because mocking the module replaces a binding the local call never reads. The pipeline records an honest skip there and leans on the full suite and the Reviewer, rather than manufacturing a green oracle that proves nothing. That trade is covered in the oracle sufficiency post, and it is the same principle throughout: record what cannot be proven instead of faking the proof.

The shape of the result

Generalizing across shapes was not a matter of teaching the Coder more shapes. It was moving more of each shape below the floor, so the Coder was needed less or not at all.

Six ticket shapes, four of them building with zero Coder cycles, climbed in the span of a week because the costly part was paid once, in the four moves that raised the floor. The bet going forward is that the next shape needs a probe, a registry extension or two, and a handful of targeted fixes, not a rewrite. So far that has held for every rung, and “so far” is carrying real weight in that sentence, because the only evidence is the rungs already climbed. This is one TypeScript codebase, and still R&D.

What transfers from this is not the list of shapes but where the generalization came from. The pipeline did not get better at handling variety by making the Coder smarter or the prompts longer. It got better by moving work below the floor, one shape at a time, until the Coder was left with only the work that is genuinely new. That is the same instinct a good engineering org runs on: when the same class of problem keeps arriving in slightly different forms, you do not staff up to handle each form; you find the one move that makes the next form cheap.


The pipeline runs inside Docker against a ~100k line TypeScript monorepo. The six first-completions described here span June 9–16, 2026. Still R&D.