No Stage Runs Forever: Retry Budgets and Escalation in an Agent Pipeline
How per-stage retry budgets, wall-clock timeouts, and a global token cap keep any stage from running indefinitely, with the Debugger as the most complex case.
Autonomous Engineering Pipeline
How per-stage retry budgets, wall-clock timeouts, and a global token cap keep any stage from running indefinitely, with the Debugger as the most complex case.
When the model has a strong prior, naming the failure mode in the prompt doesn't prevent it. Prompt rules are advisory; validators are binding.
Looking up symbols by filename instead of full path pulls every `index.ts` in the project into the agent's context. One line changed. 20 results down to 1.
Three properties of a Lego instruction set, mapped to an AI coding pipeline: why manifest quality matters more than builder quality.
Bernoulli model predicted 36% first-pass success across 248 pipeline runs. Measured: 21%. The gap explains why per-field hallucination fixes have a ceiling.
Every field a Planner emits that the codebase already knows is a dice roll. Machine extraction replaces those dice rolls with deterministic lookups.
Why tracking known architectural gaps with specific close conditions is more useful than a backlog, and what makes each entry work.
Fixture-first development as an early warning system for AI pipelines: the first real-project run confirmed three known gaps instead of discovering new ones.
Using claude -p in a pipeline? The model has bash access you never granted. Each tool call re-sends your full context. One sentence cuts token spend by 52%.
The Coder added a new function to an existing file. The pipeline reported success. All seven existing functions were gone.
A ticket that passed twice failed four times at lower model effort, exposing four structural pipeline bugs the higher-effort run had masked.
Same ticket, same pipeline config, different result two days apart. Why the first run passing was not confirmation that the constraint was enforced.
The pipeline committed code before branch isolation existed. The risk was real, named, given a close condition. That is what makes it different from a shortcut.
When the pipeline detects zero test files, logging a warning and continuing produces output that looks correct but cannot be caught by any downstream gate.
On attempt 3, the Coder tried to write a file that was not in the manifest. The write gate stopped it before anything hit disk. This is what it is for.
The Debugger receives the test failure and the code on disk, not the Coder's reasoning. That isolation is not a constraint. It is the design.
Tree-Sitter tells you where a symbol is defined. It cannot tell you where it is called. That gap cost one pipeline run 33,000 tokens to find out.
A Haiku optimization made the L2 quality gate silently pass on every run. The fix was removing the LLM call entirely.
Not a model capability problem. An agent with the wrong codebase version produces output that is plausible but wrong in ways that are hard to catch.