Vendor-Agnostic by Configuration: Per-Stage Model Setup in an LLM Coding Agent

The pipeline runs multiple stages, each calling a language model: the Planner, the Coder, the Debugger, the Reviewer, and others. Each stage has its own model configuration, set independently of every other.

One config file, one entry per stage (illustrative, not literal syntax):

planner:
  vendor: anthropic
  model: opus                    # the most capable model in the run
  timeout: 300s
  retry_budget: 1                # one re-plan on a routed manifest issue
  on_exhaust: surface_to_human

coder:
  vendor: anthropic
  model: sonnet
  timeout: 120s
  retry_budget: 3
  on_retry:
    compute: increased           # more compute on the same model
  on_exhaust: halt

debugger:
  vendor: anthropic
  model: sonnet
  timeout: 300s
  attaches_to: any_stage         # invoked on demand wherever runtime evidence exists
  on_retry:
    compute: increased
  can_route_to: any_stage        # mini-brain: routes the fix to whichever upstream stage owns it

reviewer:
  vendor: google
  model: gemini-flash            # cheap-and-fast for a scanning role
  timeout: 120s
  retry_budget: 1

Multiple vendor surfaces are supported, covering both subscription CLIs and direct SDK access. Each stage routes to a different backend based on its own vendor entry. The Reviewer in the example runs on Google while the Planner, Coder, and Debugger run on Anthropic. That is a valid configuration. Stages do not share routing state.

Flow diagram showing four pipeline stages on the left (Planner, Coder, Debugger, Reviewer) connected by arrows to vendor backends on the right. Planner, Coder, and Debugger connect to Anthropic. Reviewer connects to Google. Each stage box includes a model label: Planner/opus, Coder/sonnet, Debugger/sonnet, Reviewer/gemini-flash.

Why each stage in an LLM coding pipeline has a different cost profile

Different stages do fundamentally different work, and the cost profiles do not line up.

The Planner decides what to build: which files, which symbols, what operations, what order. It is the highest-stakes reasoning step in the pipeline. You want the model most capable of structured planning, with enough timeout to complete the reasoning, and you should budget accordingly.

The Reviewer checks whether the Coder’s output follows project conventions. It is a structured pass over a diff, closer to pattern-matching than to reasoning. The cheapest model that reliably handles this is the right choice.

The Debugger fires only after the Coder has exhausted its retry budget. At that point you need the model with the highest diagnostic accuracy, and the config reflects that: the Debugger runs at a higher reasoning budget than the builder stages, and escalates further on retries.

A single global model config means paying Planner prices for the Reviewer, or running the Debugger at Reviewer-grade reasoning depth. Per-stage configuration is how you avoid both.

A single global model config means paying Planner prices for the Reviewer, or running the Debugger at Reviewer-grade reasoning depth.

A previous post on pipeline bugs that only surface at lower model effort showed what happens when there is one global knob: lowering it exposes structural bugs the higher setting was silently compensating for. The per-stage config is the surface that makes the fix expressible without touching every other stage.

Escalation: a second spend tier inside the retry budget

Each stage can define a separate configuration for retries, distinct from its first-attempt setting. That retry configuration might mean more compute, a different model, or the same setting again. It is optional. Omitting it means every attempt runs identically.

The two-tier design exists because retry context is fundamentally different from a first attempt. When the Coder retries, it already has test failure output and a corrective brief in context. That is a harder reasoning task than the initial attempt, and the spend allocation reflects it. You are spending more at the specific moment where the cheaper attempt already failed.

For stages like the Reviewer, where a retry is a straightforward re-evaluation with additional constraints, a constant tier is often sufficient. For the Coder and Debugger, where retries carry increasing context complexity, the escalating tier matters.

Two-column diagram contrasting first-attempt and retry configuration for the Coder and Debugger stages. Left column labelled First attempt shows Coder at base compute and Debugger at medium compute. Right column labelled On retry shows Coder at increased compute and Debugger at high compute. An arrow between columns is labelled if failed.

This design is covered briefly in no stage runs forever, in the context of retry budgets. The per-stage config is what makes it expressible. The schema does not care whether escalation means more compute on the same model or switching to a stronger one entirely. The retry rung names what changes on the next attempt; the rest of the pipeline does not need to know.

What per-stage model configuration enables in practice

Cost optimisation without touching pipeline code. The config is the complete surface for tuning cost. To reduce spend on a batch of tickets, adjust model and reasoning settings per stage. No code changes, no redeployment.

Model swaps scoped to one stage. If a new model releases that performs significantly better at code review but not at planning, update the Reviewer entry. The Planner, Coder, and Debugger are unaffected. Each stage’s configuration is read independently.

Vendor swaps scoped to one stage. To route one stage through a different vendor while keeping the rest unchanged, update that stage’s entry. The routing layer dispatches each stage call against its own config independently.

Explicit timeouts with no hidden defaults. The pipeline refuses to start if any stage is missing a timeout value. The check runs before any LLM call is made. Every stage declares what it will wait for, and the pipeline holds it to that. There is no fallback a stage might silently inherit, and no timeout that looks fine in a happy-path run and quietly lets a stuck stage hang the first time something goes wrong.

The real limit of vendor swaps: feature parity is asymmetric

The architecture is vendor-agnostic. The config structure supports mixed vendor assignments across stages. What it does not guarantee is that every vendor exposes the same features.

Each vendor exposes its own set of controls, and not all of them map across. Effort control is a Claude concept: a per-stage knob that sets the reasoning budget explicitly. Switch a stage to a different vendor and that specific control is no longer available, though the other vendor may have its own equivalent settings. Stage routing stays clean across vendors, but the feature set a stage can use is shaped by whichever vendor it is pointed at.

This asymmetry means a vendor swap is not purely mechanical. You can reassign a stage to a different vendor and the pipeline will run, but the stage is operating with a different capability envelope than it was before. Whether that matters depends on the stage. The Reviewer is less sensitive to these differences than the Debugger. The Planner, which uses MCP-backed registry lookups in some configurations, is more sensitive.

The config as a cost model for the full run

The cost model for a full pipeline run follows directly from all stage entries, each with a vendor, a model, a possible escalation tier, and a timeout. If you want to understand what a run will cost before it starts, and where the budget goes when something goes wrong, the config is where that answer lives.

What the config cannot do yet: computed token caps

The global token cap in the current config is a fixed conservative number. It exists as a circuit breaker, not a cost model. It stops runaway runs without being calibrated to the actual complexity of any specific ticket.

The architectural groundwork for something better is already in place. Because the structural decomposition step runs before any expensive LLM call, the pipeline knows the complexity of what it is about to build before it starts spending. Sub-ticket count, chain depth, operation count are all known quantities at that point. A token ceiling derived from those figures would be tighter and more honest than a fixed global cap, and deviation from it would be a detectable signal: if a stage significantly exceeds the expected token count for the work it was given, that is more likely rambling than legitimate output.

The blocker is calibration data. The formula is illustrative until it is fitted to real run distributions. Each project has different characteristics: a TypeScript monorepo with heavy type extensions generates different token distributions than a Python service with simple CRUD operations. Early runs on any new project still need a conservative fallback until enough data is there to derive a project-specific ceiling.

The fixed cap is the right default for R&D. Enough logged runs on the same project will eventually make the computed version viable. Until then, the config exposes the knob; the value is a best guess.

The pipeline runs inside Docker on real tickets against a ~100k line TypeScript monorepo. Still R&D.