What the Symbol Registry Stores, and How It Stays Fresh

A previous post argued that an AI coding agent should stop asking the model what the code already knows and supply those facts from a symbol registry, a Tree-Sitter parse, and a language server (LSP). That post described the architectural move. It was light on what the registry actually contains and how it stays in sync with the workspace as files change. This post is that.

A registry that powers machine extraction has two jobs that pull in opposite directions. It needs to answer lookups in constant time without reading source files: where is getAllPosts defined, what calls it, what other symbols share its file. And it needs to be correct: a registry that returns yesterday’s symbol locations is worse than no registry, because every consumer that trusts it will then lie to the Planner with confidence. The schema and the cache invalidation strategy together are how those two jobs co-exist.

What the registry stores

The schema has four load-bearing categories of record. The shape of the records matters more than the storage engine.

Symbol records. One record per top-level symbol the parser finds. Each record captures enough to locate the symbol precisely (file, scope, type, position) and a content hash at the symbol level rather than the file level. That granularity is what makes targeted invalidation possible: when a function changes, only the records for that function are stale, not every symbol in the file.

File-level records. One record per scanned file, tracking the file’s state at the last sync. The file-level record is the unit of cache invalidation: if it matches, nothing in the file needs re-examining. It also carries the counts the pipeline uses for token-budget telemetry without requiring the file to be re-read.

Call-graph edges. One record per resolved call site, capturing which symbol called which and where the call appears. This is what answers “who calls this function” without round-tripping to the language server. The language server still owns cross-file semantic resolution, but the registry caches resolved edges so common call-graph queries do not hit the LSP every time.

Sync state. A small set of registry-wide records covering the git state at the time of the last sync, the schema version, and housekeeping counters. The git state is what determines whether an incremental sync is enough or a full reconciliation is required.

Framework-specific records (HTTP routes, JSX components, etc.) fall out of the same parse pass and are stored alongside, but they are populated by language drivers rather than by the core schema. The four categories above are the polyglot core.

The data model deliberately stops at facts that can be re-derived from the source. There is no “is this symbol important” field, no priority scoring, no semantic embedding. The registry’s job is to answer factual questions about the code, not to summarise the code. Anything that requires interpretation belongs upstream of the registry, not in it.

Three levels of hash, three levels of detection

The registry tracks change at three different granularities, and each one detects a different kind of drift. Conflating them is the source of most subtle bugs in registry invalidation logic.

The file level detects whether a file has been touched at all. It is the cheapest check: most files match on most syncs, and a match means nothing inside that file needs re-examining.

The body level detects whether a particular symbol’s implementation changed, separate from whether its file did. A function that moved because an import was added is byte-identical to what the registry recorded; only its position in the file changed. A function whose body was edited has a different body hash even if it did not move. This split is what lets downstream consumers invalidate against one symbol’s behaviour without treating every symbol in the same file as dirty.

The structural level detects whether a symbol’s shape changed. Two functions that are formatted differently but parse to the same abstract syntax tree (AST) have the same structural hash despite different body hashes. The structural hash is the right key for caches that should survive cosmetic edits (whitespace changes, comment additions) without treating them as behaviour changes.

These three levels are not redundant. They cost almost nothing to compute and to store, and they pay back in every consumer that needs to invalidate against one specific axis without invalidating against the others.

Walking only what changed

A sync brings the registry up to date by comparing the current state of every file against what was recorded at the last sync. Files that have not changed are skipped. Files that have changed are re-parsed and their records updated. The parse cost is proportional to what actually changed, not to the workspace size, which is what makes a sync on a hundred-thousand-line monorepo a matter of seconds rather than minutes.

Call-graph edges and framework-specific records are derived from the same parse pass as symbol records, so a changed file re-emits all three. Edges that cross file boundaries are only recomputed when one of the two files involved actually changed; the rest survive untouched.

Git HEAD movement is the one event that can invalidate the registry without any individual file’s hash changing. A branch switch, a reset, or a fast-forward pull can leave file hashes matching while the project is in a different state than the registry recorded. The registry tracks the git state at the time of the last sync, so this class of drift is detectable at the start of a sync and triggers a broader reconciliation before any per-file comparison begins. The parse cost is still proportional to the diff; only the comparison scope is wider.

A full re-walk regardless of file or git state fires only when the schema itself changes. That is rare. In normal operation, the sync is incremental.

The 409 hash guard

Cache invalidation at sync time catches drift between syncs. A separate guard catches drift at read time, for the case where the workspace has changed since the last sync and a downstream stage is about to read line ranges that no longer correspond to the symbol the registry says they do.

Every read through the pipeline’s IO layer compares the current file hash against the registry’s stored hash before returning bytes. When the two disagree, the read raises a structured error: a 409 Conflict pointing at the file and the mismatched hashes, with the corrective action of running a sync. Downstream stages do not see stale lines; they see the conflict, and the pipeline halts the offending stage with a precise diagnosis.

The reason for catching this at read time, not just at sync time, is that the registry can be valid at the moment a stage starts and stale by the moment the stage reads. A Coder write to one file invalidates that file’s registry records; a subsequent inspect on a different symbol in the same file would otherwise read whatever line range the registry still has cached. The hash guard turns that race into a halt, which a retry can recover from.

A bad read with no halt is a silent corruption that surfaces three stages later in a failure that is hard to attribute.

What this is, and is not

The registry is a structured cache over the source. It does not understand the project, does not know which symbols matter, and does not infer intent. Every consumer that asks the registry a question is asking a factual question with a deterministic answer.

There are gaps. Languages without Tree-Sitter coverage fall back to a coarser parse that loses per-symbol detail; file-level tracking still works, but the symbol-level fidelity is lower. Cross-repository edges are not modelled. Symbols that exist only after runtime registration (a route handler wired up through a decorator chain at module load, for example) are visible to the language server but not always to the parser, and the registry takes the parser’s view.

Where the registry is good, it is because the data model deliberately stops short of interpretation. Where it is incomplete, the gaps are the same gaps that any static-analysis layer carries: the parser does not see runtime, and runtime is where some of the truth lives.

The pipeline runs inside Docker on real tickets against a TypeScript fixture and a ~100k line TypeScript monorepo. The registry is populated incrementally during the pipeline’s preflight stage and queried by the Planner, the Coder, the Debugger, and the validators throughout the run. Still R&D.