The Quality Gate That Passed When It Failed

10 MIN READ

The optimization made every metric worse.

The L2 quality check (the gate that verifies every acceptance criterion in a ticket maps to at least one test) was running a small Sonnet call. Around 300 tokens in, eight tokens out.

Swapping it to Haiku produced this:

test_quality_l2  in=6,866  out=5,027  31.8s
WARNING: no closing fence — output may be truncated or malformed

5,027 output tokens. 31.8 seconds. And the gate passed anyway, silently on every run, because the parsing code couldn’t make sense of the output and treated a parse failure as a clean result.

That is the shape of the problem. The rest of this post is about why it happened, what replaced the LLM call entirely, and what it says about where LLMs belong in a pipeline and where they don’t.

The gate

The pipeline has a layered quality gate that runs between the Test Writer and the Coder. It checks the generated tests before any implementation code is written, and it operates in three tiers.

L1 is about import scope. The tests must only import from files listed in the ticket manifest. This is enforced through a prompt rule rather than code, because static import analysis across TypeScript barrel exports and path aliases produces too many false positives to be useful.

L2 is about coverage. Every acceptance criterion in the ticket must map to at least one test description. If a criterion has no matching test, the Test Writer is asked to retry before the Coder sees the file.

L3 is about assertion depth. A test file full of .toBeDefined() calls is not useful. If the file contains only weak assertions and no strong ones, it fails.

L1 and L3 are both deterministic. L1 is a prompt rule, L3 is a regex pass over the test content. L2 was the exception, and that is what this post is about.

Pipeline quality gate flow: Test Writer feeds into L1 import scope check, then L2 criteria mapping check, then L3 assertion depth check, then Coder

The original design

The problem L2 is solving is that acceptance criteria and test descriptions are often written in different language. A criterion might say:

Return 404 when the post is not found.

The test might be called “handles missing resource.” Those mean the same thing, but a keyword match would miss the connection.

The decision was to use a small LLM call for the mapping. Send the list of criteria and the list of test descriptions, ask the model which criteria have no matching test, get back a JSON object.

The prompt is around 300 tokens. The expected response is {"missing": []} or {"missing": ["Return 404 when post not found"]}. The check passes or fails, and either the Test Writer retries or the pipeline moves on to the Coder.

When I ran the first successful proof run of a pagination ticket through the pipeline, the Sonnet call for this check looked like this:

test_quality_l2  in=6,857  out=8  3.6s

Eight output tokens. It returned {"missing": []}, the gate passed, and the pipeline continued. Total cost: negligible. Total time: 3.6 seconds out of a 68-second run.

The optimization

The reasoning was: this is a binary yes/no task. The model produces eight tokens. Sonnet is the most capable model in the pipeline and it is doing nothing interesting here. Switch it to Haiku. Same task, fraction of the cost, probably faster.

The next run:

test_quality_l2  in=6,866  out=5,027  31.8s
WARNING: no closing fence — output may be truncated or malformed

Haiku produced 5,027 output tokens for a task that should produce eight. It took 31.8 seconds. The warning in the logs meant the model had wrapped its output in a code fence and either failed to close it or produced content long enough that something downstream broke.

What happened: Haiku did not follow the output format instruction. Instead of returning a compact JSON object, it produced a long analysis:

Let me examine each criterion in turn. The first criterion states that… The test descriptions include… Based on this analysis…

Then, somewhere in that output, the actual JSON. Then more text.

The parsing code caught the failure and treated it as a clean result. The gate passed. The pipeline continued to the Coder. The quality check had been silently disabled.

The failure mode matters more than the failure

Haiku performing worse than Sonnet is not the point. The shape of the failure is.

When the L2 check fails to parse the model’s response, the code treats it as a pass. The reasoning behind that design was to avoid false positives: if the model returns something unexpected, it is better to let the pipeline continue than to block it on a malformed response. That reasoning is sound when the model occasionally produces slightly malformed JSON on an otherwise correct response.

It breaks down completely when the model consistently produces the wrong format, because then every run silently bypasses the quality gate. The pipeline logs show [agent call] for the L2 stage, a result appears, the gate passes. Nothing in the output tells you the check did not actually run. You would only know by looking at the token count and noticing 5,027 output tokens where you expected eight.

The Haiku switch did not just cost more time and money than Sonnet. It made the quality gate invisible.

A check that passes when it fails is worse than no check at all, because no check at all is at least honest about what it is doing.

Why the fix is not “use a better model”

The obvious response is to switch back to Sonnet. That works. The Sonnet call returned the right format. But the problem runs deeper than model selection.

The L2 check is answering a structural question: does every criterion have at least one test description that covers it? LLMs are good at understanding whether two sentences mean the same thing. They are less reliable at following strict output format constraints, and the failure mode when they do not is a silent pass rather than an explicit error.

The right tool for this task is not semantic understanding. It is token overlap.

Token overlap

The replacement compares meaningful terms extracted from each acceptance criterion against meaningful terms extracted from each test description. A criterion is considered covered if it shares enough terms with at least one test description to suggest they are addressing the same thing.

This approach has real limitations. A criterion phrased very differently from its corresponding test description will not match. But when acceptance criteria and test descriptions are both generated by the same pipeline for the same ticket, the language tends to be close enough that term overlap catches the obvious gaps.

The failure mode is explicit. If a criterion has no matching test, the check fails with a message listing the unmatched criteria. There is no model output to parse, no JSON to validate, no fallback that silently promotes a failure to a pass. The check either finds a match or it does not.

The threshold problem

Shipping the token overlap approach immediately exposed another edge case. The initial threshold was calibrated to require more than one shared term, on the reasoning that a single common domain word is too coincidental to count as real coverage.

That broke on the very next run. One of the ticket’s criteria was a regression check:

Existing POST /api/posts behaviour is unchanged.

The meaningful terms in that sentence are “existing”, “behaviour”, “unchanged”, and “posts”. Test descriptions for a pagination feature will not use “unchanged” or “behaviour”. The only overlap with any pagination test was “posts”: one term, not enough to meet the threshold. The check failed, the Test Writer retried, failed again, and the pipeline stopped.

Dropping the threshold fixed it. The check now flags criteria with zero term overlap. A single shared domain word is enough to say “there is at least a test in this area.” The regression criterion passes because any pagination test mentioning “posts” shares that term.

This is a known weakening of the check. In a posts API, “posts” appears in almost every criterion and almost every test description. A criterion sharing only “posts” with a test is barely verified at all. But the alternative is false positives that block the pipeline on legitimate test files, which is worse.

What this is and what it is not

The token overlap check is a stopgap. It is better than the Haiku LLM call because it fails explicitly and runs in zero time. It is worse than the Sonnet LLM call because it cannot handle vocabulary differences between criteria and test descriptions, and the threshold calibration is fragile.

The right long-term answer is probably embeddings. Compute a vector for each acceptance criterion and each test description, check cosine similarity, flag criteria with no test description above a similarity threshold. That handles synonyms, stemming, and phrasing differences properly without touching an LLM in the hot path. A small local embedding model would work fine for ten to thirty short strings.

That is not built yet. For now, the check catches the obvious case: a criterion with no test in the same general area at all. The Reviewer stage, which reads the full diff and test output at the end, is the real backstop for coverage gaps that L2 misses. L2 is a cheap early signal, not a guarantee.

What this run actually cost

The comparative numbers for the same ticket, same fixture:

Run 1 (Sonnet)Run 2 (Haiku)Run 3 (token overlap)
test_quality_l2 output tokens85,0270
test_quality_l2 duration3.6s31.8s<1ms
Total runtime68s95sn/a¹
Gate actually ranyesnoyes

¹ The run that validated the token overlap replacement hit a separate blast-radius failure: the Coder changed a function signature that broke two callers the Planner had not included in scope. That added a Debugger retry cycle and 74 seconds. It is a known pre-LSP problem, not a symptom of the L2 change, so the total is not comparable to runs 1 and 2.

Run 2 took 27 seconds longer, spent more tokens, and did not actually check anything. Run 3 spent no tokens, took no measurable time, and the gate ran correctly. The optimization made every metric worse. The replacement made every metric better.

The generalizable part

When you have a task that requires a specific output format and the penalty for a format violation is a silent pass, the model’s tendency toward verbosity becomes a correctness problem rather than a quality issue. Smaller models tend to be more verbose, not less. They hedge more, explain more, and are less reliable at following strict structural constraints when those constraints conflict with their training toward helpfulness.

The right question before adding an LLM call to a pipeline is not “which model is cheapest for this?” It is “what happens when the model gets this wrong?” If the answer is “the check passes anyway,” that is not a check. It is a logged no-op with latency attached.

For semantic matching between human-written text, LLMs are the right tool. For structural verification of whether a set of tokens overlaps with another set, they are not. The acceptance criteria mapping check is the second problem, dressed up as the first.


Numbers are from real runs against a TypeScript blog API fixture. The pipeline runs inside Docker with no human in the loop between ticket input and reviewed, committed code.