A codebase that works today is not the same codebase an AI agent was trained to understand. It has been refactored, extended, renamed. Functions that used to live in one file now live in three. An interface that used to take two arguments takes five. The agent does not know this. It works from whatever it last saw, or from a general model of how code tends to be structured, and it produces output that is plausible but wrong in the specific ways that only someone who knows this particular codebase would catch.
This is not a model capability problem. It is a context problem. The model is not failing because it lacks intelligence. It is failing because it is reasoning about the wrong version of the codebase.
I started building a pipeline to see if this could be fixed.
The core idea is that before any agent writes a single line of code, the system builds a live index of every symbol in the codebase: every function, every type, every interface, the file it lives in, the line it starts on. This index is rebuilt before every task. Agents do not guess at where things are. They query the index and get a precise, current answer.
The token cost of a smaller context is a side effect of this, not the goal. The goal is correctness. A precise context produces better output than a large one, on the same model, every time.
The other thing I knew from shipping real software: the environment has to be clean before anyone starts work. You would not onboard a new engineer onto a project where the tests are already failing. The pipeline applies the same standard. Before any agent runs, the baseline is confirmed green. If it is not, the pipeline stops. No agent time, no API cost, no output to review. The problem was already there before the pipeline touched anything.
These two ideas, precise context and a clean starting state, are where the project began. Whether they hold up at scale is what the rest of this series is about.
Still R&D. The project is running on a test fixture. The harder tests are ahead.