No Stage Runs Forever: Retry Budgets and Escalation in an Agent Pipeline

Most write-ups of agent pipelines stop at the happy path. The Planner produced a manifest, the Coder wrote code, the tests passed. What happens when a stage fails, and what stops the pipeline from burning the entire token budget on a stuck retry loop, is the part the public literature on agentic pipelines mostly skips. The honest version of the design lives in those failure boundaries, not in the diagram of the happy path.

The unwritten rule of any pipeline that runs without a human in the loop is that no stage runs forever. Every stage has a retry budget. Every retry has a wall-clock cap. The whole run sits inside a global token cap. When any of those is exhausted, the pipeline halts and the failure surfaces, rather than quietly spending more money on attempts that are not converging.

This post walks through how that boundary is configured, stage by stage, on the pipeline I am building. The Debugger is the showcase, because it has the most failure complexity, but the same shape applies everywhere.

The contract: bounded at every layer

Three layers of bound run on every ticket:

The outermost is the per-run token cap. The pipeline counts input tokens minus cache reads, plus output tokens, against that ceiling on every model call. When the running total crosses it, the call that would have exceeded the budget raises and the run aborts. The cap exists so a runaway stage cannot drain a wallet on its own.

The next layer is the per-stage wall-clock timeout. Every stage has a timeout, in seconds, and the pipeline refuses to start if any stage is missing it. There is no hidden default. A missing wall-clock cap is the kind of thing that looks fine in a happy-path run and silently lets a stuck stage hang for half an hour the first time something goes wrong.

The innermost is the per-stage retry budget. The Coder gets three attempts by default. The Reviewer’s redo cycles are capped. The Planner gets one re-plan when the Debugger routes a manifest issue back to it. Each one is a different field, but the shape is the same: a bounded number of attempts, a defined exit when the budget is gone.

Concentric rings diagram showing three nested failure boundaries on every ticket. Outer ring labelled PER-RUN TOKEN CAP. Middle ring labelled PER-STAGE WALL-CLOCK TIMEOUT. Inner ring labelled PER-STAGE RETRY BUDGET. Three dashed arrows originate from the centre and extend outward, labelled HALT, SURFACE TO HUMAN, and ROUTE UPSTREAM.

Per-stage configuration is not optional

Every stage in the pipeline has its own configuration entry. A stage entry pins the vendor surface (which CLI or SDK the call goes through), the specific model within that vendor, the wall-clock timeout, the retry budget where applicable, the rung the stage moves to on retry, and what the pipeline does when the budget is exhausted. Stages can be on different vendors and different models in the same run. The Planner could run on Gemini Pro while the Debugger runs on Claude Sonnet, if cost and capability per stage suggest that split.

The shape, in pseudocode (illustrative, not a 1:1 with the live config):

planner:
  vendor: anthropic
  model: opus                    # the most capable model in the run
  timeout: 300s
  retry_budget: 1                # one re-plan on a routed manifest issue
  on_exhaust: surface_to_human

coder:
  vendor: anthropic
  model: sonnet
  timeout: 600s
  retry_budget: 3
  on_retry:
    compute: increased           # more compute on the same model
  on_exhaust: halt

debugger:
  vendor: anthropic
  model: sonnet
  timeout: 300s
  attaches_to: any_stage         # invoked on demand wherever runtime evidence exists
  on_retry:
    compute: increased
  can_route_to: any_stage        # mini-brain: routes the fix to whichever upstream stage owns it

reviewer:
  vendor: google
  model: gemini-flash            # cheap-and-fast for a scanning role
  timeout: 300s
  retry_budget: 1

The point of the snippet is the axes, not the syntax. Five stages, the same shape repeated, different values per stage. The vendor and model are independent of each other and independent of the other stages. The retry budget is bounded per stage. The retry rung names what changes on the next attempt. The exhaust behaviour names what happens when the budget is gone, and the answer is never “keep going.”

Configuring stages independently is not cosmetic. The right model, the right cost-per-call, and the right retry budget depend on the work, not on a uniform setting. A previous post on pipeline bugs that only surface at lower model effort showed what goes wrong when there is one global knob: lowering it exposes structural bugs the higher setting was silently compensating for. The fix in that case was structural. The per-stage config is the surface that makes the fix expressible. The Reviewer can run on a cheaper, lower-effort setting and the Coder can run with a higher escalation rung, in the same run, against the same ticket, because the failure profiles of those stages are not the same.

The Debugger: most failure complexity in one stage

The Debugger is the most interesting stage to look at, because it is the one that runs precisely because something else has already failed. It does not fail on its own, and it does not sit at a fixed step in the pipeline. The Debugger is invoked on demand from any stage that has hit a failure with runtime evidence to inspect, and it returns a diagnosis the calling stage acts on. The most common attachment point is the Coder’s retry loop, where the flow is: the Coder writes code, the tests run, the tests fail, the Debugger diagnoses, the Coder retries on the next attempt. But the same call shape applies anywhere a stage produces output that fails a runtime check.

The Debugger acts as a mini-brain across stages, not as a sub-stage of any one of them. It is the only stage with access to runtime evidence: the actual test output, the type-checker errors, the code on disk, the failure shape from whichever stage called it. It uses that evidence to decide which upstream stage owns the fix. A diagnosis can route to the Coder (“the implementation is wrong, here is a corrective brief and what to avoid on the next attempt”), or to the Planner (“the manifest is broken, no Coder retry will fix it”). The same diagnostic call routes the work to whichever upstream stage is the actual source of the problem. That routing role, plus the on-demand attachment, is the difference between a retry helper and a mini-brain. It is also why the Debugger can rescue runs that no single stage retrying on its own could.

So “the Debugger fails” can mean three different things, and each one is configured separately.

The first kind of failure is that the diagnosis was right, the Coder followed it, and the tests still failed. A sharper diagnosis on the next attempt might rescue the run. That is what the escalation rung is for. The first Debugger call inside a stuck Coder loop runs at the base configuration; the retry runs at a more expensive one, reserved for the runs that did not yield to the cheaper diagnosis. The expensive rung might be more compute on the same model, or a stronger model entirely. The schema does not care which.

The second kind is that the diagnosis was wrong, and the Coder is now off track. The retry budget contains the damage. Three attempts is a hard ceiling. The pipeline does not retry forever, and it does not extend the budget when the Debugger thinks it is close.

The third kind is the most interesting. The Debugger is correct that the manifest itself is broken, and no amount of Coder retries will fix it. The earlier post on why the Debugger never inherits the Coder’s reasoning covers the input contract that lets the Debugger see this clearly. It also covers the failure class the Debugger cannot handle at all: missing callers outside its authorized scope surface as pipeline halts, not re-plans. What routes back to the Planner is different, a case where the manifest content is wrong but the failure is inside the Debugger’s scope. When that is the diagnosis, the pipeline exits the Coder loop early, sending the finding upstream to the Planner for a re-plan instead of burning the remaining attempts.

That last path is the difference between a retry budget and an escalation strategy. A retry budget caps cost. An escalation strategy decides where to spend the next attempt. Inside the Debugger alone, both are configured: a more expensive call when the cheaper diagnosis did not land, and an early exit from the calling stage’s loop when staying inside it is provably wasted.

The same shape applies to every other stage

The Coder has three attempts, with a more expensive configuration on attempts two and three. When the third fails without a manifest-issue diagnosis, the run halts.

The Reviewer has a cap on Coder redo loops triggered by review verdicts. The Reviewer can request changes; it cannot request changes indefinitely. When the cap is reached the run is closed, with the latest reviewed state preserved.

The Planner has a one-shot re-plan when the Debugger routes a manifest issue back to it. The re-plan gets the Debugger’s diagnosis as input. After that, if the rebuilt manifest still does not produce a passing run, the pipeline surfaces the failure to a human. The Planner does not get a third try.

None of these caps is the most expensive part of running the pipeline. They are usually idle. They exist for the runs where something has already gone wrong and the next call is not going to fix it.

Escalation in practice: bad upstream context, not a capability ceiling

The temptation when a stage gets stuck is to throw a stronger model at it. The retry rung exists for the cases where that instinct is right.

The observation from running this pipeline against real tickets is that the instinct is usually wrong. When a stage fails repeatedly, the cause is almost always bad context coming in from upstream, not a capability ceiling on the stage itself. A Coder that cannot make tests pass after two attempts has not usually run out of model capability. It has been handed a manifest that is missing a symbol, or constraints that contradict the type system, or a test that asserts behaviour the data model cannot produce. Swapping the Coder to a more expensive model on attempt three lets the model burn more compute trying to bridge the gap, but the gap is upstream. A stronger model on the same bad context produces a more confident wrong answer slightly more often than a weaker model on the same bad context.

This is the reason the Debugger’s most useful exit, in practice, is not a more expensive retry inside the Coder loop. It is the early route back to the Planner when the manifest is the actual problem. That route does not escalate the model. It moves the fix upstream, where the bad input was generated. Most of the time, that is what unsticks the run.

It does not mean the escalation rungs are useless. There are real cases where the cheaper call gave a vague diagnosis, the more expensive call gave a precise one, and the next attempt landed. Those cases are real and the rung pays for itself when they happen. They are also a smaller share of the stuck runs than the design instinct suggests. The honest reading is that the schema lets you escalate the model as a fallback. The more useful design move, by a wide margin in observed runs, is the upstream route.

The final hard stop

When every retry budget has been spent and the failure has not resolved, the pipeline halts and resets. There is no fourth Coder attempt, no second Planner re-plan, and no automatic switch to a different model. The run fails, the failure surfaces, and the ticket is flagged for human review.

The reset is part of the halt, not an afterthought. The workspace is restored to the state it was in before the run began. Partially-applied edits are reverted, half-written tests are dropped, orphan files are cleaned up. A failed run leaves nothing behind for the next ticket to inherit. The next ticket runs against the same starting point as if the failed run had never happened, which is the only way a ticket queue can survive an unattended failure without compounding the damage across runs.

This is intentional, and the thing it is preventing is the headline. A pipeline that retries until it succeeds, with no exhaustion path, is a pipeline that loops on its own bill until something external trips it. The retry budgets, the wall-clock timeouts, and the global token cap exist so the loop ends. When the budget is gone, the run stops, every time, regardless of how close the next attempt looks. The escalation rungs inside those budgets are tradeoffs about where to spend the next attempt, not promises that some attempt will eventually work.

The contract is not there to make stuck runs fast. It is there to stop them from becoming endless.

The reason this is worth describing in detail is that “what happens when each stage fails” is the part of an agent pipeline that actually has to work in production. The happy path is a presentation. Those pieces, the retry budgets, timeouts, escalation rungs, and hard stops, live in one config file because they are the same kind of decision applied to different stages, not because the file was easier to organise that way.

The pipeline runs inside Docker on real tickets against a ~100k line TypeScript monorepo. Still R&D. The design favours fixing bad upstream context over throwing a stronger model at a stuck stage; the schema supports both, and the upstream route has been more reliable.