The Decision an AI Coding Agent Can't Make Alone: Operator-Grounded Intent Capture
A deterministic resolver that passed every unit test then failed 40% of runs. The fix was operator confirmation, not a smarter algorithm.
An autonomous engineering pipeline in Python that takes a ticket from input to tested, reviewed code with no human in the execution loop. The architecture behaves more like a compiler for LLM behavior than a conventional agent framework: LLMs handle the parts that require interpretation, deterministic code handles everything that can be computed. The pipeline runs inside Docker against a ~100k line TypeScript monorepo. On certain ticket shapes the pipeline now commits code with zero LLM source-code authoring. The run archive spans ~1,000 runs, almost all of it the same handful of ticket families re-run to shake out non-determinism. All posts document real runs, real failures, and the architectural decisions behind them. Still R&D.
The architectural overview: four eras, one ticket, cost per run dropping from $0.385 to $0.074.
How the pipeline's structural floor rose high enough to commit a multi-layer feature without any LLM source-code authoring. The current state of the system.
Why the pipeline uses language-server context instead of loading whole files.
The registry that keeps agent context precise across a changing codebase.
Why every field a Planner emits that the codebase already knows is a dice roll, and how machine extraction eliminates it.
Why manifest quality determines outcome more than builder quality: the principle behind the Synthesizer's role.
How write scope is constrained so the right code lands in the right place.
Why structural validation matters more than instructions alone.
The three rules for deciding which operator decisions the pipeline derives, which it asks for, and which it refuses to default: and why the answer must stay as structured data.
How the pipeline derives a behavioral test below the flat-stub boundary from the same operation that writes the guard, commits it as a gate, and proves it non-vacuous with mutation testing.
How the Debugger router dispatches each failure type to the stage that owns the repair, and why half the original destinations were already gone when it shipped.
How operator design decisions moved from free-form Q&A into typed handoff fields, and how the grounding chat gained a four-layer structural defense against LLM drift.
How operator intent becomes a structured handoff before autonomous execution starts.
Observability: why structured run traces are required to diagnose pipeline failures.
Reliability limits, retry budgets, and escalation boundaries.
Empirical validation across 248 runs showing why patch fixes were not enough.
Why warning-and-continuing produces output that looks correct but cannot be caught by any downstream gate.
How the Debugger's isolation from the Coder's reasoning is a design requirement, not a constraint.
A deterministic resolver that passed every unit test then failed 40% of runs. The fix was operator confirmation, not a smarter algorithm.
The pipeline derives a behavioral test from the same operation that writes the guard, runs it below the flat-stub boundary, and proves it via mutation testing.
Five techniques that turn a raw compiler error into a frequency-ordered hypothesis ladder, and why the pattern matters more for LLM debuggers than human ones.
The Debugger in the autonomous engineering pipeline now routes each failure to the stage that owns the fix. Half the original destinations no longer exist.
Two open quality items are closed. Operator design decisions flow through typed handoff fields. The grounding chat has a structural defense against LLM drift.
The pipeline completed a multi-layer TypeScript feature with zero LLM code generation: 6/6 ticket tests, 959/959 suite, $0.308 versus $0.681.
A new pre-autonomous chat stage lets the operator ground design decisions in the registry before the pipeline runs, adding a new top to the trust hierarchy.
Seven pipeline runs, one ticket, four architectural eras. Per-test cost dropped from $0.385 to $0.074 by replacing LLM guesswork with structural derivation.
Swarm parallelism is a throughput solution applied to a reliability problem. Probabilistic verification of probabilistic output does not converge.
What shipped in the three weeks after the 248-run hallucination ceiling: removing the LLM from computable decisions and validating everything else.
When a ticket has two equally-plausible interpretations, a deterministic stage stops the pipeline and asks before any Coder agent runs.
Each stage in the pipeline runs against its own model and vendor config. How that design enables per-stage cost control, model swaps, and vendor flexibility.
The run archive is where pass/fail becomes diagnostic: per-stage operation logs, reasoning traces, and a correlation token that spans every stage.
How per-stage retry budgets, wall-clock timeouts, and a global token cap keep any stage from running indefinitely, with the Debugger as the most complex case.
The mechanics behind a binding validator: why synchronous pre-commit timing, structured rejections, and retry folding are each individually load-bearing.
When the model has a strong prior, naming the failure mode in the prompt doesn't prevent it. Prompt rules are advisory; validators are binding.
The data model behind the symbol registry: per-symbol records, file-level hashes, call-graph edges, and the invalidation strategy that keeps it current.
Looking up symbols by filename instead of full path pulls every `index.ts` in the project into the agent's context. One line changed. 20 results down to 1.
Three properties of a Lego instruction set, mapped to an AI coding pipeline: why manifest quality matters more than builder quality.
Bernoulli model predicted 36% first-pass success across 248 pipeline runs. Measured: 21%. The gap explains why per-field hallucination fixes have a ceiling.
Every field a Planner emits that the codebase already knows is a dice roll. Machine extraction replaces those dice rolls with deterministic lookups.
Why tracking known architectural gaps with specific close conditions is more useful than a backlog, and what makes each entry work.
Fixture-first development as an early warning system for AI pipelines: the first real-project run confirmed three known gaps instead of discovering new ones.
Using claude -p in a pipeline? The model has bash access you never granted. Each tool call re-sends your full context. One sentence cuts token spend by 52%.
The Coder added a new function to an existing file. The pipeline reported success. All seven existing functions were gone.
A ticket that passed twice failed four times at lower model effort, exposing four structural pipeline bugs the higher-effort run had masked.
Same ticket, same pipeline config, different result two days apart. Why the first run passing was not confirmation that the constraint was enforced.
The pipeline committed code before branch isolation existed. The risk was real, named, given a close condition. That is what makes it different from a shortcut.
When the pipeline detects zero test files, logging a warning and continuing produces output that looks correct but cannot be caught by any downstream gate.
On attempt 3, the Coder tried to write a file that was not in the manifest. The write gate stopped it before anything hit disk. This is what it is for.
The Debugger receives the test failure and the code on disk, not the Coder's reasoning. That isolation is not a constraint. It is the design.
Tree-Sitter tells you where a symbol is defined. It cannot tell you where it is called. That gap cost one pipeline run 33,000 tokens to find out.
A Haiku optimization made the L2 quality gate silently pass on every run. The fix was removing the LLM call entirely.
Not a model capability problem. An agent with the wrong codebase version produces output that is plausible but wrong in ways that are hard to catch.