Testing an LLM-based coding agent on a toy fixture is usually treated as throwaway work. The fixture is small, the tickets are artificial, and the real proof is supposed to come from running the pipeline against a production codebase. In my experience building this pipeline, the fixture phase turned out to be the opposite of throwaway. It was the cheapest, most reliable early warning system the project had.
By the time I pointed my autonomous engineering pipeline at a real codebase for the first time, it had been through a long stretch of development: eighteen tickets validated on a fixture project, more than eighty failure modes catalogued in the project’s architecture-gaps document, and the full stage chain running end to end. The pipeline itself is an AI coding agent that plans, writes tests, writes code, and reviews its own work.
The target was a TypeScript monorepo: ~100,000 lines of code, 150 test files, 953 tests, production route files over 1,000 lines. Nothing about it was controlled. It was not designed to be easy for an agent.
The Planner read the codebase, identified the right files, resolved thirty-nine dependency signatures from the registry, and produced a valid manifest for eight and a half cents.
Then the builders ran, and the infrastructure fell apart in three places.
Three things broke on the first real-project run
1. The Coder could not handle large files.
The target was jobs.ts, 1,385 lines. The pipeline’s reconstruction step is designed to splice the Coder’s output back into the original file at the right location. On retry, instead of replacing the symbol it had already written, it appended a duplicate. The file grew past 1,600 lines. The next Coder attempt timed out at 300 seconds with 21,000 input tokens just for the source file, before it had seen the tests, the manifest constraints, or the Debugger’s notes.
2. The Test Writer and Coder disagreed on the function’s signature.
The Test Writer called serializeJobsToCsv(rows, "lightweight"), a string parameter selecting a column preset. The Coder implemented serializeJobsToCsv(rows, columns: string[]), an array of explicit column names: same function, different API shape. The Debugger correctly diagnosed the mismatch, but the Coder’s fix still used the wrong shape, because the Test Writer’s output was not in the Coder’s context in the first place.
3. The Coder could write to test files.
A prompt rule said “do not modify test files.” There was no structural gate enforcing it specifically on test-file paths. The pipeline does have a write gate that stops writes to files outside the manifest’s modify list, covered in an earlier post on this blog (“The Write Gate”), but test files live in a distinct field of the manifest schema and the gate at the time of this run did not extend to them. The model could read the instruction and choose to follow it, or not. On the fixture the cost per occurrence was small. On a real project with expensive stages, “the model mostly follows instructions” is not a cost control.
All three were already in the backlog
What made this real-project run different from a typical “ship it and see what breaks” story is that none of the three failures were surprises.
The large-file problem was catalogued as Gap #87 in the project’s architecture-gaps document, logged long before the real-project run, the first time the Coder tried to modify one function in a fixture file and broke three others in the process. The fixture files were 50 to 200 lines, small enough that a workaround held. The entry in the document said: “this will break on production-sized files.” It did, exactly as predicted, on a file seven times larger than anything the fixture contained.
The signature mismatch was a known data-path gap. The Test Writer’s output was not being injected into the Coder’s context. On the fixture this was a minor annoyance, because the fixture’s functions were simple enough that the Coder usually guessed the same signature. On a real project with domain-specific APIs, the divergence was immediate and unrecoverable without the data-path fix.
The test-file write issue was catalogued as a failure mode early in the project’s life. I had deferred the structural fix, a sandbox or API-mode enforcement, because the cost per occurrence was low on the fixture. The structural vulnerability was documented, the workaround was known, and the trigger condition was named.
The real project confirmed all three; it did not discover any of them.
Why the gap catalogue was worth keeping
The fixture was not a toy demo. It was a systematic early warning system, and the gap catalogue was what converted fixture-run incidents into predictions about the first real-project run.
Every failure mode I catalogued during fixture runs was a prediction about what would go wrong at scale or on unfamiliar code. Some predictions were wrong, and some problems I expected never appeared. But the three biggest issues on the first real-project run were already in the backlog with gap numbers, root cause analyses, and proposed solutions.
This changes the economics of the first real-project run. Instead of an open-ended debugging effort on unknown failures, the work collapsed to confirming known ones. The diagnostic was already written and the fix designs sketched; the run was informational, not exploratory.
The catalogue effect
A backlog tracks what a team plans to fix.
A gap catalogue tracks what the team has chosen not to fix yet, with the root cause, the workaround currently holding the line, and the specific condition that will force the fix.
The first is a list of future work. The second is a running bet on which risks are safe to defer, and what signal will tell you when they are not. Most engineering teams keep the first. Almost none keep the second in writing. The difference shows up the first time the signal fires.
Writing down every failure mode, including the ones that cost an extra attempt, produced slightly wrong output, or only appeared at low model effort, creates a searchable database of infrastructure risk.
When the real project broke, I did not debug from scratch. I grepped the failure-modes document. “Reconstruct appends instead of replacing” was a specific instance of Gap #87. “Test Writer and Coder disagree on signatures” was a new failure-mode number, but the underlying gap was already documented as a missing data-path injection. The new number was five minutes of writing, not five hours of investigation.
The fixture project cost roughly four dollars across eighteen tickets. The real-project Planner-only run cost $0.085. The full pipeline attempt that failed cost under a dollar. The ticket eventually shipped green across twenty-five pipeline runs that each exposed and fixed one infrastructure bug, for about $0.76 end-to-end on the final attempt. The total cost of proving that those three gaps were real blockers was under five dollars, and the diagnostic came pre-written. For reference, the same three gaps discovered cold on a production incident would cost at minimum an hour of engineering time each before a root cause was even named, and significantly more than that if the engineer handling the incident is new to the codebase.
The scale gap: what a fixture cannot test
Fixtures are small by design. Their small size is the point, not the limitation. The fixture proved that the brain layer works. The Planner reasons correctly about codebases, the Debugger diagnoses logic bugs, the Reviewer catches scope violations. These are the hardest things to build and the most expensive to debug.
What the fixture could not test was scale: files over 500 lines, test suites that take four minutes to run, monorepo workspace configurations, CRLF line endings from a Windows clone. These are infrastructure concerns, not architectural ones.
The real project proved the infrastructure did not scale. That is actually the better outcome. The brain, the part that is hard to fix, works. The plumbing, the part that is mechanical and diagnosable, did not. I would rather have it that way around.
The brain transfers, the plumbing does not
The Planner produced a valid manifest on its first attempt against a codebase it had never seen. It correctly identified the target file, the dependency chain, the test file location, and the acceptance criteria. Thirty-nine machine-extracted dependency signatures, all correct.
The nine pipeline code fixes that landed alongside the first real-project run were all mechanical: shell operator handling in test commands, CRLF normalisation after git clone, workspace hoisting conflicts, LSP log paths, timestamped output directories. Each was a short edit; none required rethinking the architecture.
This is the argument for fixture-first development. Build and validate the hard part in a controlled environment where failures are cheap and fast to diagnose. Then when the real project breaks, the failures are in the easy part, and the hard part already works.
This is not TDD. TDD exercises the code under test. It is not canary either. Canary stages a rollout of already-shipped code. Fixture-first development exercises the pipeline’s reasoning on a controlled project, so the expensive-to-debug architectural layers have already been proven by the time they meet unfamiliar code.
One honest caveat before the next section: this is a single real project. The claim that the brain layer transfers across codebases survives or not as more real projects land. What this run proved is that it can, not that it will. The catalogue discipline is what makes the next real project cheap to run, regardless of whether this one ticket’s success generalises directly.
Where this stops being true
Fixture-first works to the extent the fixture exercises the capabilities the pipeline claims to have. A fixture with only short files produces false confidence about large-file handling. A fixture with only one API style produces false confidence about signature variance across a real codebase. Every gap I predicted correctly was one I had already written down as a known limitation of the fixture itself. “This will break on production-sized files” is only a prediction because I knew the fixture did not contain files anywhere near the size a real codebase carries, not because the break itself was hard to foresee.
The discipline that made the catalogue valuable was not writing down failures. It was writing down what the fixture did not cover. The gap catalogue does both; a backlog usually does only the first.
What I would tell someone building an AI pipeline
Do not skip the fixture phase. Build a small, controlled project that exercises every capability the pipeline claims to have. Run every ticket type against it. Write down every failure, including the ones worked around. And, more important than any of that, write down every capability the fixture cannot exercise, and why, even with no plan to fix it yet. The gap between “this was tested” and “this was admitted as untested” is where production surprises live. Closing that gap is cheap when it is writing and expensive when it is debugging.
Eighty failure modes feel like overhead when they are being written. They feel like foresight when the real project breaks and the diagnosis is already on the screen.
The pipeline runs inside Docker on real tickets. Numbers in this post are from actual runs against a TypeScript blog-API fixture and a ~100k line TypeScript monorepo. Session counts and gap numbers are snapshots as of the run being described, not current totals. Still R&D.