Who Reviews the Swarm? Why Probabilistic Verification Fails at Scale

8 MIN READ

Six of ten parallel agents converge with arrows onto a single human reviewer. To the right, a tall queue of pending diff cards accumulates faster than it can be drained. Labels read: 10 diffs per run, reviewer capacity approximately 3 diffs per hour.

The dominant direction in AI coding agent tooling is parallelism: a swarm of agents working simultaneously on the same codebase. Run more agents at once, get more output faster. Ten agents produce ten diffs in the time one agent produces one. The throughput argument is obvious. The review problem is not.

If ten agents produce ten diffs simultaneously, someone or something has to verify that those diffs are correct before they reach the codebase. On a complex evolving monorepo, that verification is not trivial. A diff that passes type-check and tests can still hallucinate a function name, write to the wrong file, introduce dead code, or implement the wrong interpretation of the ticket. The structural failures are the ones that survive automated gates.

This post is a thesis, not a proof. It is based on what building a sequential autonomous pipeline has taught about where failures actually live, and what that implies for verification at scale.

The math of human review

A developer can meaningfully review one complex multi-file diff in roughly 20 minutes. That is long enough to understand the intent, trace the call sites, verify the scope, and catch the subtle wrong. That is three diffs per hour, at best, for work that requires genuine attention.

A swarm of ten parallel agents produces ten diffs simultaneously. The human reviewer is immediately the bottleneck. The agents generate faster than any human can verify. The speed gain from parallelism accumulates in a queue that the reviewer cannot drain.

At that point one of two things happens. The reviewer rubber-stamps diffs they cannot fully evaluate, accepting the agent’s output on faith, or they context-switch across ten partial reviews, losing the depth of understanding that makes review meaningful in the first place. Neither outcome preserves the reliability the review was supposed to provide.

The obvious counter: let agents verify each other

The natural response is to add a verification layer. Ten agents produce ten diffs. Ten verification agents check the ten diffs. No human required at that stage.

This leads directly to the question: who verifies the verification agents?

The answer cannot be more agents without infinite regress. At some point the chain must terminate in a trust decision, and the question is only where in the chain a human makes it. Moving that decision upstream does not eliminate it.

More importantly, probabilistic verification of probabilistic output does not converge on reliability. If the primary agent has a non-trivial failure rate on structurally subtle errors (wrong call sites, scope creep into plausible-but-wrong files, dead destructuring that passes type-check), a second agent reading the same diff inherits the same blind spots. Two probabilistic systems checking each other do not produce a deterministic outcome. They produce a joint probability that is lower than either system alone unless the verification agent is significantly more reliable than the primary. And if the verification agent is that much more reliable, the question becomes why it is not being used for the primary task.

Where the failures actually live

From building a sequential pipeline and running it across hundreds of tickets, the failure classes that survive automated gates are not the obvious ones. Type-check passes. Tests pass. The diff looks plausible.

The failures are: a symbol that exists in the registry but not in the file the manifest named. An assertion against a mock that cannot intercept an intra-module call. An unused destructuring pattern that the TypeScript server classifies as a warning rather than an error. Code that satisfies the test oracle but misses the intent of the ticket.

These are the failures a human reviewer with full context would catch. They are also the failures a second LLM agent reading the same diff is likely to miss, for the same reasons the first agent missed them: the diff looks structurally valid, the tests pass, and the subtle wrongness requires either codebase knowledge the agent does not have or a rule the agent was not given.

The structural verification alternative

The alternative is not more agents. It is deterministic enforcement at the point of generation.

A write gate that rejects out-of-scope files before they hit disk does not have a reliability curve. It either fires or it does not. A feasibility gate that checks acceptance criteria against the registry before any agent runs does not hallucinate. A validator that matches file paths against excluded patterns is not probabilistic.

Deterministic verification of probabilistic output does converge.

Each rule that moves from “the agent should know this” to “the pipeline enforces this” removes one class of failure from the probabilistic surface entirely.

The tradeoff is specificity. A deterministic gate only catches what it was built to catch. Adding a new rule requires understanding the failure class first. That understanding comes from running the system and observing what slips through, which is the empirical work the pipeline has been doing.

Two-panel comparison diagram. Left panel labelled "Probabilistic verification" shows a Primary Agent and Verifier Agent both reading the same diff, with a shared region marked "shared blind spots: wrong call sites, intra-module mocks, dead code." The output is labelled "joint probability." Right panel labelled "Deterministic verification" shows three inputs — registry, language server, Tree-Sitter — feeding a single GATE node with two binary outputs: FIRE in red and NO FIRE in green.

The unit economics question

The throughput argument for swarms is clear. Ten parallel agents produce ten diffs before any verification happens. The cost argument is less often stated explicitly.

A failed run has a cost beyond the API bill. There is the API spend for the run itself, the developer time to review and reject the diff, the time to diagnose why it failed, and the cost of the rerun. In a sequential pipeline that cost is bounded: one run, one reset, one retry. The retry budget is explicit and the failure surfaces with a structured diagnosis.

In a parallel swarm, ten agents running simultaneously means ten times the API spend before any verification happens. If the verification step rejects half the diffs, the cost of those rejected runs does not disappear. It accumulates. And if the rejected diffs require human diagnosis to understand why they failed, because the failure is subtle enough to survive automated gates, the developer time cost multiplies with the agent count.

This is not a claim about how any particular team runs their swarms. Different teams have different configurations, different retry budgets, different verification layers. The question is worth asking explicitly: what does a failed swarm run cost, end to end, including the human time to verify and reject output that looked plausible but was wrong?

The answer depends entirely on the reliability of the output reaching review. Which brings the argument back to the probabilistic verification problem.

The same question in other domains

The review bottleneck and the probabilistic verification problem are not unique to software engineering. Wherever agents are used to produce consequential output, the same structural question applies: who verifies the output, and what does that verification actually guarantee?

In legal, medical, and financial domains the failure modes are less visible and more expensive. A hallucinated function name breaks a build. A hallucinated legal precedent, a missed drug contraindication, or an incorrect tax calculation may pass every automated check because the equivalent of a “test suite” in those domains is much harder to define. The swarm model applied to those domains inherits the same verification problem at higher stakes.

The architectural answer is the same regardless of domain: deterministic enforcement against a domain-specific source of truth, before the output reaches the person who has to sign off on it. In code that source of truth is the symbol registry. In legal it is the clause registry. In medical it is the protocol and patient record database. The registry changes. The principle does not.

A future post will cover what that architecture looks like across domains in more detail.

The thesis

Swarm parallelism is a throughput solution applied to a reliability problem.

Generating more output faster does not make any individual unit of output more reliable. It produces more output that requires the same verification overhead per unit. When verification cannot keep pace with generation, reliability either degrades (rubber-stamping) or throughput gains are lost (bottlenecked review).

Probabilistic verification agents do not solve this. They shift the verification burden without eliminating the reliability gap.

The alternative is to invest in deterministic enforcement at the point where the agent makes decisions, so that the output reaching review has already had its structural failure classes eliminated. Less output, more reliable output, review that is meaningful because it is not drowning.

This is a thesis based on building one sequential pipeline, not a study of swarm architectures. It may be wrong. But the question it raises, who reviews the swarm and what does that review actually guarantee, seems worth asking before the swarm becomes the default.