Proving a generated test can fail: mutation testing as a sufficiency gate for an AI coding agent

When an AI coding agent writes the code AND the test that checks it, you have a problem that human teams mostly do not. The agent can author a test that passes no matter what the implementation does. It compiles, it runs green, the Reviewer approves, and the test proves nothing. This is a vacuous test, and it is worse than no test, because a green vacuous test reads as evidence. The standard defense against a vacuous test is mutation testing: introduce a deliberate bug into the implementation and confirm the test fails on it. The hard part is making that cheap enough to run on every change the pipeline ships.

The pipeline I work on generates a behavioral oracle for each behavioral change: a unit test derived from the same operation that drives the deterministic code edit, not authored freehand by a language model. To trust that oracle, the assembly stage already runs confirm-RED. It checks out the committed code, reverts the change, runs the new test, and requires it to fail. That proves the test reacts to the change being absent.

confirm-RED proves necessity. It does not prove sufficiency. A test can fail when you delete the whole feature and still pass against a subtly wrong version of it. The oracle that asserts “the enrichment call is skipped when the flag is set” will go red if you remove the guard entirely, and stay green if the guard is inverted so it skips on the wrong condition. confirm-RED catches the first. Only mutation testing catches the second.

So the gate is mutation testing, scoped to one job: take the committed, passing implementation, introduce one targeted wrong version of the change, run the bound oracle, and require it to go red. A surviving mutant means the oracle is insufficient, and it fails the ticket the same way a red test does. The mutant is applied in a sandbox, the oracle runs against it, and the tree is reverted before anything else sees it. Nothing the gate writes is ever committed.

The scope of that job is narrower than “mutate the change.” Mutants land only on the lines the ticket added or changed, even when those lines sit inside a method that also carries older, unrelated code. The gate does not mutate that older code, and a survivor from this gate says exactly one thing: this oracle would not catch a wrong version of the ticket’s own change. It says nothing about whether the rest of the method is correct. Writing mutants for pre-existing lines would mean writing new tests for behavior that has no spec beyond the code itself, characterization tests that pin whatever the method currently does, bugs included, as if it were the contract. That coverage belongs to the target suite, the full-suite run, and the Reviewer, the same three defenses that already cover every file the ticket never touched.

The cost is deliberately small, which matters because mutation testing has a reputation for being slow enough to skip. The gate runs no language model, by design: an LLM-authored mutant would reintroduce exactly the hallucination surface the pipeline exists to remove. The mutants are derived and applied deterministically, so there is no marginal model cost on top of the build. Each mutant runs only the single oracle it is bound to, one scoped test file that finishes in seconds, not the full target suite. The gate sits at the end of the ticket, after the change is committed so there is real code to mutate, and before the slow full-suite run, so an insufficient oracle fails the ticket in a few short test runs rather than after a full regression pass. Binding each mutant to one oracle is also what makes a kill mean something precise: the test that went red is the exact test whose sufficiency was in question, not some unrelated assertion elsewhere in the suite that happened to trip.

An earlier post, The Behavioral Oracle, derived this behavioral oracle and proved a single one non-vacuous by running two mutants against an isolated copy of the codebase by hand. confirm-RED is the simplest mutant of all: the no-guard case where the implementation does not exist yet, and it already ran on every assembly-stage run. The rest was a manual exercise. This post is about turning that exercise into a gate: an automated stage that runs a derived mutant on every behavioral change, across three different ticket types, and that passed when it should have failed four times before I trusted it.

I did not trust this gate until I had watched it kill a mutant I understood.

“One targeted wrong version of the change” is not generic. Classic mutation testing mutates source lines blindly, which creates two problems at scale: a flood of mutants, and among them a large fraction of equivalent mutants, mutations that change the source text without changing its behavior. An equivalent mutant cannot be killed by any test, so it surfaces as a survivor and either wastes review time or fails the gate for a defect that does not exist. That noise is the main reason mutation testing rarely runs in a real pipeline.

Because the pipeline knows which operation wrote the behavior, it derives a mutant that is a real, plausible defect for that exact operation, which keeps the set small and the equivalents rare. The residual equivalents are handled structurally rather than by review: a mutant whose apply changes no bytes is treated as inconclusive and discarded, never counted as a survivor. That is the no-diff check in the failures below, and it is the equivalent-mutant problem reappearing in a system that derives its mutants instead of spraying them. Over three different ticket types, the op-derived approach produced three different mutator families.

Family one: the gating flag (cross-file guard)

The first type threads a boolean flag through a call chain so that one downstream side-effect is skipped when the flag is set. The committed code, anonymised, ends up as a guard around the gated call:

// processItem
if (!options?.skipEnrichment) {
  // ...the enrichment block that calls enrichItem(...)
}

The behavioral oracle is bivalent: with the flag set, assert enrichItem is not called; with the flag clear, assert it is. The mutant that tests sufficiency is a polarity flip. Invert the guard in place so the suppression runs on the wrong condition:

if (!(!options?.skipEnrichment)) {

Now the flag set means the call runs, and the flag clear means it is skipped, the exact opposite of the contract. A sufficient bivalent oracle fails on both branches. The gate ran this against the committed code, the oracle went red, and the mutant was killed. killed=1, the rest of the ticket green, full target suite 962 passing, both Reviewers approving.

The gate actually planned four mutants for that one guard: the polarity flip, plus three wrong-target variants that try to suppress a sibling call instead. Three of the four changed no bytes, because the sibling they targeted sat in a position with no guard to invert, so they were discarded as equivalents and never run. That left one real mutant and one kill, where a blind line mutator would have produced dozens, most of them equivalent, and left a human to triage the survivors.

A code diff showing the guard condition if (!options?.skipEnrichment) changed to if (!(!options?.skipEnrichment)), with two test result lines below both reading FAIL in red, one for the flag-set branch where enrichItem should not be called and one for the flag-clear branch where it should, and a final line reading mutant KILLED in teal.

This run also exposed an honest limit. The same ticket cascades: setting the flag is supposed to also skip a second side-effect, buildReport, in the parent function. But buildReport lives in the same file as its caller, and an intra-file call is structurally un-mockable in Vitest. A spy on the module namespace replaces a property the local call never reads, and re-importing the module returns the real binding with its local references already baked in. There is no unit oracle to mutate, so the gate records that relation as oracle-skipped rather than inventing a vacuous one. That gate is covered by the full suite and the Reviewer instead. Recording an honest skip is the whole point. A gate that manufactured a green oracle there would be the exact failure it exists to prevent.

Family two: removal (the inverse mutant)

The second type is a removal ticket: take an existing option key out of a call so a downstream behavior stops firing. The oracle here is an inertness assertion. It checks that the call no longer carries the removed key, using a not-called-with-key assertion against the spied call.

The naive mutant, re-run the removal operation, does nothing. The key is already gone from the committed code, so re-removing it is a no-op against the post-removal state and the oracle stays green by default. A no-op mutant that survives is a false survivor, and it would fail the gate for the wrong reason. The mutant that actually tests this oracle is the inverse: re-add the removed key at the call site the inertness assertion watches.

// committed: the key is gone from the call
moveItem(id, {});

// the re-add-key mutant puts it back
moveItem(id, { force: true });

The inertness oracle is the assertion that goes red when the key reappears:

expect(moveItem).toHaveBeenCalledWith(
  id,
  expect.not.objectContaining({ force: true }),
);

If the oracle is sufficient, the key reappearing makes the assertion fail. killed=1. Removal sub-tickets that are covered only by this ticket-level inverse are labelled as an expected skip, not flagged as a missing oracle, because the kill happens once at the chokepoint rather than per sub-ticket.

Family three: greenfield endpoint (contract mutation)

The third type builds a brand-new endpoint that returns a per-status count object, something like { queued: 3, active: 1, done: 7 }. There is no guard to flip and nothing to remove. The oracle asserts the response contract: each status key present, each count correct. The mutant zeroes one member of the contract. If the oracle is sufficient, a count that should be non-zero coming back as zero fails the assertion. The first version of this oracle could only verify the default status, because the status enum is server-assigned and the test could not seed a row into the other states. That gap closed by seeding one record into every status through the target’s own create-then-update idiom and asserting each count is one.

// the oracle, after seeding one record into every status
expect(body).toEqual({ queued: 1, active: 1, done: 1 });

// the mutant zeroes one member; the handler now returns
// { queued: 1, active: 0, done: 1 }, and the assertion fails on `active`

With that, the gate produced killed=3, survived=0 against three contract mutants, full suite 956 passing.

The five times the gate passed when it should have failed

A mutation gate is itself a quality gate, which means it can be vacuous in exactly the way it exists to catch. Mine was, five separate times, and every time it announced a confident result while testing nothing.

On the removal type, the gate’s mutant inputs were read before they were written. The step that freezes what to mutate ran at the top of the assembly stage, and the step that produces those facts ran later in the same cycle, so the gate saw an empty plan and passed vacuously. Freezing the plan once, at the point the facts exist, and reading from there closed it.

On the flag type, the gate re-derived its plan from the sub-tickets at the end of the run, after execution had already consumed the operation that wrote the guard. The relation with a real oracle had become invisible, so the gate only ever saw the one un-mockable same-file relation and passed. Capturing the plan at the moment the operation is still present, and carrying it forward rather than reconstructing it from a tree that execution had mutated, closed this.

Then a subtler one. With the plan now captured, the gate planned the right mutant, but applying it failed because the captured operation was incomplete: a field populated during normal execution was absent in the captured form. The apply reported zero changes, the gate treated that as “no diff, nothing to test,” and discarded it silently. The run logged killed=0 again. This one hid from every isolated reproduction I wrote, because my hand-built test operations always included that field where the real captured one lacked it. It only showed up in a full assembly-stage run, and only because I had added a line that logs every discarded mutant with the reason. Filling the field when capturing the operation resolved it.

And one false survivor in the other direction. A wrong-target mutant that tries to wrap a call with no surrounding guard makes no change, but the apply reports success, so the oracle ran against unmutated code, passed, and the mutant was recorded as survived. That would fail the gate for a defect that does not exist. Confirming a real working-tree diff before trusting any apply, and treating a no-change mutant as inconclusive rather than a survivor, closed this.

The fifth was the opposite problem: a result that looked like a kill but was not one. A malformed mutant, one whose applied form does not even compile, exits with a failure code, and a naive predicate reads any non-zero exit as the test catching the bug. That credits the oracle for a kill it never made. Closing it meant no longer trusting the exit code and reading the actual test output instead: a kill requires a parsed assertion count showing a real test failed, not just a process that failed. A compile error, an import error, zero tests collected, none of those are a kill. They go in a third bucket, inconclusive, and get discarded rather than scored either way.

The pattern across all five is the same. A confident-looking result that has not been checked against what actually ran is the most dangerous output this gate can produce, because it looks like a pass or a kill either way. I keep the discarded-mutant log on permanently now. It is the line that turned three days of “the gate passes but I do not believe it” into a one-line root cause.

A later post found the mirror failure: an equivalent mutant no real input could kill, which failed a ticket whose code was correct.

What is still open

Threading sub-tickets get no mutant. The available mutations for an add-a-field-to-callers operation are either idempotent on already-threaded code or add-without-remove, which leaves the asserted call site intact and produces a false survivor. The mutant that would actually test threading sufficiency is “drop the threaded argument,” an inverse mutator that does not exist yet. The gate records those as a known skip rather than pretending to cover them. That is the next mutator to build, and until it exists, threading sufficiency rests on the oracle’s positive assertions and confirm-RED, not on a kill.

This is still R&D. But the pattern holds: prove necessity with confirm-RED, prove sufficiency by mutation, derive the mutant from the operation that wrote the behavior so it is a real defect rather than line noise, and record a known skip wherever no sufficient mutant exists. The hardest part was accepting that the gate built to catch vacuous tests had to be held to the same standard, and watching it fail that standard five times before it earned the green.

The pipeline runs inside Docker on real tickets against a ~100k line TypeScript monorepo. Still R&D. Updated 2026-07-23 with a fifth false-pass and a note on the gate’s delta scope.