Steering a Coding Agent Across Long Sessions: The Rules It Can't Drift From

The first post in this pair was about memory: three plain-markdown documents that carry the moving state of a project across sessions, so each new session doesn’t have to rebuild what the last one worked out. Those documents answer one question, where are we? They are only half of what keeps an agent on course. This is the other half, and it answers a different question: how do we work here, and what must never happen?

How much an agent carries from one session to the next depends on the setup: some persist a memory, and mine runs in a fresh container that keeps none. Either way, you onboard the agent at the start of each session, the way you onboard a new engineer, and a new engineer needs two things: the current state of the work, and the house rules. The three documents hand over the first. The standing rules hand over the second, and they fail in the opposite way to the documents. The documents rot if you let them; the rules fail by never being written down at all, so every session the agent rediscovers them, or violates them, and you correct the same thing for the hundredth time. The fix is to make the rules a durable artifact the agent loads every session, and then to make them enforce themselves rather than sit there hoping to be read.

The always-loaded contract (CLAUDE.md / AGENTS.md)

There is one file the agent reads at the start of every session automatically. Whatever your agent picks up by default, CLAUDE.md for Claude Code or AGENTS.md for several others, that file is the contract the work runs under: what is always true about the project, and what is always required of the agent. It is the opposite kind of document to the three from the first post. Those carry state, which changes every session. The contract carries the rules, which barely change at all.

Because the agent loads the contract at the start of every session, it pays a context cost on every run, and that single fact sets the shape of the whole file. The contract has to stay short, or it becomes the bloat it was meant to prevent, charged not once but on every run. So it holds rules and pointers, not prose. A rule states a constraint in one line. A pointer sends the agent to a deeper document for the detail, and only when it needs the detail. Anything that reads like an explanation belongs in that deeper document, not in the file that loads every time.

The contract carries three things. The first is what the project is, in one or two sentences, enough to orient an agent at the start of a session, when its working context is fresh. The second is the hard constraints, the rules the agent must not break, each written so it can act on the rule without having to interpret it. The third is the standard procedures, the steps a new engineer would need told: how to locate code before reading it, when to regenerate a derived file, what counts as done. The agent contract template in the repo (templates/AGENTS.template.md) is laid out around these three.

A diagram of the agent contract file, CLAUDE.md or AGENTS.md, loaded at the start of every session. It holds three things: a one-or-two-sentence statement of what the project is, the hard constraints as one line each with a pointer to a deeper document, and the standard procedures. A loop arrow notes it loads every session, so it stays short.

The discipline that keeps the contract honest is the same one that keeps the worklist honest in the first post, aimed at a different failure. The worklist rots when you write history into it; the contract rots when you write session state into it. A note about what this session is doing, a status, a half-finished thought, all of it has somewhere else to live, and none of it belongs in a file whose whole job is to hold what is true on every session rather than what is true today. When session detail leaks in, the file grows, the per-session cost grows with it, and the agent spends more of its opening context reading the contract than acting on it.

Hard constraints, and the failure behind each rule

The hard constraints are the non-negotiables: the rules the agent must not break, each written plainly enough that it can act on the rule without having to interpret it. What makes a constraint hold up over time is not how it is phrased. It is what is recorded behind it.

A constraint with no recorded failure reads, to a later session, as an arbitrary preference. A session that takes a rule for a preference will drop it the first time it is inconvenient, usually with a reasonable-sounding justification.

So behind each rule I keep the failure that produced it, written as the concrete incident rather than a general principle. This is the same move as the reasoning column on a closed gap from the first post: the cost of the failure is already paid, so the thing to do is write down what it bought. In practice the rule lives as one line in the always-loaded contract, and the incident lives in a deeper document the agent only opens when that rule is in play.

The rule I lean on most is that every operation that derives a fact has exactly one home, and everything routes through it. Before writing a function that resolves, normalises, or looks something up, the agent is required to search for the concern first, and either call the implementation that already exists or add the new one to its canonical place. A near-duplicate written somewhere else is a violation even when it looks slightly different.

That rule exists because of a failure I have watched happen more than once. The agent reads narrowly, which is usually the right instinct for keeping context small, and in reading narrowly it does not find the helper that already does the job, so it writes a second one inline. Now there are two functions with the same responsibility and slightly different behaviour, and that is the dangerous case, because each one looks correct on its own and a reviewer has no signal that the other exists. The two drift, and a bug eventually surfaces in whichever caller depended on the copy that was wrong. A single canonical home does more than tidy the code. It makes the duplicate impossible to write without first deleting the original, which forces the conflict into the open where someone will see it. The worked example carries this rule with its own incident attached (examples/marker-api/docs/HARD_CONSTRAINTS.md), so you can see the shape a recorded failure takes.

Generated facts, not maintained ones

Some of what the agent needs every session is a fact about the codebase, most often a map of what the modules are and what each one is for. The instinct is to keep that map by hand in the contract, but don’t.

A hand-maintained fact rots the moment the code moves and the map does not move with it, and the failure is worse than having no map at all, because the agent trusts the stale map over the code in front of it. So the map is generated and never hand-edited. The agent changes a module’s own docstring, and the map is rebuilt from the docstrings. This is the same principle as the rotation script in the first post: the work that has to be exact is done by code, and the judgment is left in the text a human writes. On the state side that kept detailed history from being paraphrased; here it keeps a fact from drifting away from the code it describes.

A generated map has a second benefit too. A module with no docstring shows up in the map as a blank, so the rule that every module states its one job stops being something the agent has to remember and becomes a visible gap in a file I already look at. The wider rule underneath both is to derive a fact from the source whenever the agent needs it, instead of asking the agent to carry it, because a model asked to restate a fact from memory can restate it wrong and a generated artifact cannot. The generator and a real generated map are in the repo (scripts/gen_structure.py and examples/marker-api/docs/PROJECT_STRUCTURE.md).

A generated project structure map for the Marker example. Five of six modules show a one-line purpose pulled from their docstring; the sixth, search.py, shows "- (no description)," surfacing its missing docstring as a visible to-do.

Making the rules enforce themselves

A rule the agent only reads is one it forgets the moment it is moving fast, three hours into a session, which is exactly when the rule matters. So the rules that can be checked are not left as text. They are wired into a skill the agent runs against its own work: locate code before reading it, confirm a write landed before building on it, regenerate any derived map after a structural change, run the tests that cover what changed before calling a task done, and stay inside the context budget.

This is the standing-rules version of the move that made the handoff a command rather than a ritual in the first post. The point is not to trust the agent, or myself, to remember the disciplined thing under pressure. It is to make the disciplined thing the path of least resistance, so the checks happen by default. What can be checked mechanically is checked mechanically, and the rules end up being applied during the work instead of read once at the start of it. That skill is in the repo (.claude/skills/guard/).

Auditing the whole codebase, not just your diff

The guard skill enforces the rules in the moment, while the agent works. But some violations don’t arrive in a single action. They accumulate.

Convention drift gathers in the code nobody has touched recently. A constraint gets bent slightly in one session, a second responsibility creeps into a file across three, a helper gets quietly reinvented in some corner, and none of it shows up in a single diff. The instinct, for the agent and for a tired operator both, is to scope any review to what just changed, and “pre-existing, out of scope” is exactly how the drift survives. So there is a second enforcement skill on a slower cadence. It audits the whole codebase against the constraints rather than the current diff, and it runs every few sessions instead of every edit. The checks split into two kinds: the mechanical, greppable ones, like a banned name or a missing docstring or a forbidden call shape, which a small collector can surface, and the ones that need judgment, like whether a file mixes concerns or whether the same fact is derived in two places. The skill produces a table and I decide what to act on (.claude/skills/architecture-review/).

The most common finding is a file that has grown a second responsibility, and that has a skill of its own. The decision to split is about responsibilities, never length. A 5,000-line module with a single job is healthy, and a 200-line file that mixes three concerns is not. I did not start there. The first version of this rule was a line-count limit, a hard ceiling on how long a file could be, and it failed in a way worth recording. When a file sat a line or two over the ceiling, the agent would hunt for something to cut, and the cheapest thing to cut was almost always a comment or a block of documentation. So it would quietly remove the explanation that made the code readable to satisfy a number that did not matter. The rule was guarding length and spending documentation to do it, which is backwards, and that is why it is now about responsibilities, with length only a hint about where to look. Length also costs me less than it might, because I read code through a symbol registry built from Tree-Sitter and a language server (LSP), not by opening files. The agent pulls the exact function it needs, so a long single-purpose module never enters the context whole, and the window-filling cost a big file would otherwise impose mostly does not apply. A cohesion test settles it: state the file’s purpose in one sentence, check whether its function groups actually call each other, look for anything that has drifted past the docstring, and ask whether the file will grow by more of the same concern or along a different axis. If the file has one purpose, the skill says so and stops (.claude/skills/refactor/). One detail from that skill is worth pulling out, because it has caught me more than once: moving a function breaks references in more forms than a grep for from X import Y can see. A string inside a mock patch and a module-attribute access both reach the moved name too, and those are the references that still pass at the time of the move and break a test several sessions later.

That is enforcement at two cadences. The guard skill works per action, the audit and its paired fix work across every few sessions, and between them they cover both the rate at which a rule gets broken in the moment and the rate at which drift sets in unnoticed.

A diagram of enforcement at two cadences. Per action, the guard skill checks each edit: locate before read, confirm the write landed, regenerate derived state, run the tests. Every few sessions, an architecture review audits the whole codebase against the constraints and sends cohesion findings to the refactor skill. The two cadences catch in-the-moment breaks and slow drift respectively.

Where the two halves meet

State and rules are separate documents with separate jobs, but they are one system, and they share one instinct: keep the judgment in the text a human reads, and push the mechanical, must-be-exact work into scripts.

The rotation script moves history verbatim, the structure generator derives the map, and the guard skill runs the checks. None of those is a judgment call, so none of them is left to a model working under context pressure. What is left over is the part that is actually yours to steer: the rules themselves, the reasoning behind a gap, the decision about what to build next. Both halves of the system exist to make that part the only part that needs your attention, whether the project runs for one session or three hundred.

Where this stops

None of this is enforced beyond what can be checked mechanically, and the gaps are worth naming. There is no lint that recognises a semantically duplicate function, so the canonical-home rule rests on the habit of searching for the concern before writing, which an agent can skip and a person can forget. Some constraints are review-only, with nothing structural behind them yet. And the whole arrangement is still a discipline held by the person at the keyboard. It holds for as long as they keep applying it, and the machinery here does not remove that dependence. It only lowers the cost of doing the disciplined thing far enough that I keep doing it. That is the same limit the first post ends on: these documents and skills make the careful path cheap enough to stay on, they do not force it.

The contract template, the constraints doc, the structure generator, and the three skills (the per-action guard, the periodic architecture audit, and the refactor it hands cohesion findings to) are in the steering-coding-agents repo, alongside the three handoff documents from the first post. Both halves are there. Copy the ones you need into your own project and adapt the first few entries.