Why a Warning Is Worse Than a Hard Stop

The first version of the preflight check logged a warning and continued.

Zero test files detected. Warning written to the log. All five agents proceed.

That felt reasonable. Greenfield projects have no tests yet. The pipeline is designed for exactly that situation: a ticket comes in, the Test Writer writes failing tests, the Coder implements against them, the pipeline proves the code works. Why block a greenfield project at the gate?

The answer is in what the Test Writer does before it writes a single test.

What the Test Writer reads first

Before producing any test output, the Test Writer reads the existing test suite. Not to check coverage or find gaps. To learn conventions.

It is looking at things like: how tests in this project are structured (describe/it nesting, test vs it preference), how imports are written (relative paths, barrel exports, path aliases), what assertion style is used (expect(res.status).toBe(200) vs expect(res).toMatchObject({status: 200})), how async tests are handled (async/await vs .resolves/.rejects), whether tests use a shared server setup or create a fresh instance per test, and whether there are common mock patterns for the database or external services.

These are not things the Test Writer guesses from the ticket. They are things it reads from the codebase. The quality of its output is directly bounded by the quality of the examples it can observe.

With zero examples, it cannot observe anything. It writes tests that look structurally plausible based on its training data. For a TypeScript project, those tests will probably use Jest syntax, probably use describe blocks, probably have reasonable imports. They will be internally consistent. They just will not match the actual conventions of the project, because there is nothing to match against.

Why that matters downstream

The Coder receives the test file the Test Writer produced. It implements against those tests. If the Test Writer used relative import paths like '../../lib/posts' but the actual project uses path aliases like '@/lib/posts', the Coder will write implementation code that follows the pattern in the tests. The tests pass. TypeScript might not complain. The pipeline reports success.

The problem surfaces later, when the code is reviewed or deployed, or when another ticket’s Coder follows the same conventions and adds to the divergence. Or it does not surface at all, and the codebase accumulates a pattern that does not match the rest of the project.

The key point: no stage in the pipeline flags this. The Test Writer did not fail. The Coder did not fail. The tests passed. The Reviewer sees a green run and approves. The pipeline produced output that is wrong in a way that is invisible to every gate it passed through.

Flowchart showing two paths from a preflight check: the warn-and-continue path leads to invented conventions and undetected wrong behaviour; the hard-stop path leads to adding a bootstrap test and correct output.

Why a warning does not catch this

A warning in the log does not interrupt the pipeline. In a pipeline running inside Docker with no human in the loop, a warning at the preflight stage is visible only if someone reads the full run output afterwards. The pipeline still runs. Tokens are still spent. The Test Writer still generates tests from guesswork. The code still gets committed.

The point of the preflight check is to stop before anything is spent. Zero test files means the Test Writer is about to operate without the input it needs to do its job correctly. Logging that as a warning and continuing is the same as logging “warning: fuel gauge reads empty” and continuing the drive. The warning is accurate. The outcome is unchanged.

A hard stop changes the outcome. The pipeline halts with a specific, actionable error before any agent runs, before any tokens are spent, before any code is written. Nothing is committed. The state of the repository is unchanged.

Pre-flight FAILED: no test files found.
Add at least one bootstrap test before running the loop.

That error costs nothing to produce and nothing to act on.

The bootstrap test

The correct response to a hard stop is not to weaken the gate. It is to add one test.

Not a full suite, not comprehensive coverage. One test that establishes the conventions the Test Writer needs to learn from.

For a TypeScript API project, something like this is sufficient:

import request from 'supertest'
import { app } from '../lib/server'

describe('smoke', () => {
  it('API responds', async () => {
    const res = await request(app).get('/api/health')
    expect(res.status).toBe(200)
  })
})

That single test tells the Test Writer several things: the project imports from '../lib/server', it uses supertest for HTTP testing, it uses describe/it blocks, it uses async/await, it accesses the response status with res.status, and it uses expect().toBe(). That is enough context to write tests that match the project’s conventions.

Three lines of setup is not a burden. It is a prerequisite. A pipeline that can infer conventions from one example is more useful than a pipeline that runs without examples and produces output that looks correct but is not.

Where hard lines belong

The preflight check is one instance of a general decision: where in a system does a hard stop produce a better outcome than a warning?

The answer depends on what the warning allows through. If the warning allows through a run that produces output which looks correct but is wrong in a way the downstream gates cannot detect, the warning is not a softer version of a hard stop. It is a mechanism for producing undetectable failures at the cost of appearing flexible.

For the Test Writer, the downstream gates are the test results themselves. If the tests pass but were written from invented conventions, the tests are not a valid check. They are a tautology: the Coder implemented against what the Test Writer wrote, so of course the tests the Test Writer wrote pass. The quality gate is only meaningful if the tests were written from real conventions that reflect what the code should actually do.

The hard stop exists because a loud error at startup is recoverable. A quiet failure that passes every gate is not.

One honest caveat: the hard stop checks count, not quality. A project with ten poorly-written tests will satisfy the check. The Test Writer will learn from whatever examples exist, good or bad. The gate establishes a floor (at least one example), not a ceiling. If the existing tests are misleading, the Test Writer will be misled by them. The gate cannot fix that. It can only ensure that the pipeline does not run with no examples at all.

The pipeline runs inside Docker on real tickets. Still R&D.