Steering a Coding Agent Across Long Sessions: Three Documents and One Rule

14 MIN READ

An AI coding agent has a context window, and real work is bigger than it. A project that runs past a single session accumulates more state than fits in one window, so each new session begins with only a fraction of what the last one had in front of it. Agents do persist something across sessions now, a CLAUDE.md or AGENTS.md file, a built-in memory, whatever the tool provides, and those hold up well for the stable things like conventions and project rules. What they do not hold is the moving state of the work: what the last session worked out, what it tried and ruled out, what it decided to do next. That part gets re-derived every session, and you spend the opening stretch rebuilding it by hand. This is the practical limit most people hit the first time they take Claude Code, or any other agent, past a one-off script, and it shows up as three separate problems.

The first is that the working state is gone between sessions: the next session re-derives what the last one already knew. The second is context bloat inside a session, where history, dead ends, and stale notes crowd out the task until the agent is reasoning over noise. The third is the one most people miss, which is that the notes you keep to fix the first two problems rot. Whatever progress document you start with turns into an unreadable changelog within a week, and then nobody reads it, agent or human.

I have run one project this way for a few hundred sessions, steering Claude Code by hand the whole time. The system that keeps it coherent is three plain-markdown documents, one rule that keeps them clean, and scripts for the mechanical upkeep. None of it is specific to Claude or to any one codebase. It is closer to how you would run handoffs on a team than to anything model-specific.

Why I keep three handoff documents instead of one

The instinct is to keep a single notes file. It always becomes the same thing: part todo list, part diary, part backlog, too long to load and too noisy to read. The agent skims it and misses the line that mattered.

The fix is to split the file by tense and by who reads it.

The forward worklist (I call the file CONTINUE_FROM_HERE.md) is present and future tense. It is the only file the next session is required to read in full, so it holds the current state and what to do next, and nothing else. The session history is past tense. It records what happened each session and why. Both a human and the agent read it, but only when looking back, never to decide what to do next, and it is not loaded into every session by default, so it is free to be long. I can ask the agent what we changed a few sessions ago and it reads the history to answer. The backlog (an architecture-gaps document) is a table, not prose. Each row is a known gap between where the project is and where it is going, with a status, and the agent reads it when choosing what to build next.

Three cards comparing the handoff documents. CONTINUE_FROM_HERE holds present-and-future current state and next actions and is read first every session; SESSION_HISTORY holds the past per-session narrative and is read on demand; ARCHITECTURE_GAPS is a backlog table of open gaps with status and reasoning, read when choosing what to build.

The backlog has a second use that took a while to show up, and it turned out to be the one I rely on most. The reasoning column on each row matters as much as the status. When a gap closes, the row moves to a closed-gaps file instead of being deleted, and that file becomes a searchable record of why the system is built the way it is. (An earlier post covers what makes each architecture-gaps entry work; the close condition matters as much as the gap itself.) Months later, when I have forgotten why something was designed or fixed a particular way, I grep the closed ledger for it, and the row that recorded the gap also recorded the reasoning at the time it was made. The operator and the agent both use it that way. It is often faster than reading the code or the git log, because the row states the intent that the code only implies.

The session history and closed-gaps ledger turned out to have a second use: they are most of a draft post. Between them, the session history and the closed-gaps ledger record a problem as it was lived: what broke, what I tried, what failed, and the fix that finally held, each entry written at the time with the detail still fresh. Months later that is enough to write a detailed technical post or a post-mortem from the record alone, without me having stopped mid-implementation to gather material for it. I can work a hard problem across several sessions, ship the fix, and only then write it up, because the history and the ledger already hold the sequence of attempts and the reasoning behind the one that worked. The notes were kept to steer the work, not to publish, and they still turn out to be most of a draft.

The split works because each document now has exactly one reason to be read, so each can be optimised for that one reason. The worklist stays short because it never also has to be a record. The history is allowed to grow because nothing loads it to make a decision.

The rule that stops the worklist rotting

The forward worklist has one constant enemy, which is the urge to write “here is what we did this session” into it. Give in once and it starts becoming a diary. Give in for a week and it is the same unreadable pile the split was meant to prevent.

The rule that protects it is mechanical, so you can apply it when you are tired and not thinking clearly. Before saving any bullet to the worklist, check it for a session marker (a session number, a date, a run id) or a closure verb (closed, done, landed, fixed, added). A bullet with either is a changelog entry, not a worklist entry. Delete it, or rewrite it as the present-tense fact. “We closed the validation gap in session 214 by adding a pre-commit check” becomes “the pre-commit validation check is now mandatory,” or it goes to the history, or it goes nowhere. A bullet that only makes sense as a record of what changed does not belong in a forward document.

A worklist bullet shown twice. The rejected version, in red, reads "We closed the validation gap in session 214 by adding a pre-commit check," with the session number and the word "closed" marked as disqualifiers. The accepted rewrite, in green, reads "the pre-commit validation check is now mandatory."

Before saving any bullet to the worklist, check it for a session marker (a session number, a date, a run id) or a closure verb (closed, done, landed, fixed, added). A bullet with either is a changelog entry, not a worklist entry.

This rule matters more than the split itself. The three-document split without it just gives you three files that each decay into changelogs at their own pace.

Keeping the files small enough to load

There is a trap in this. Every file the next session loads costs context, and history grows without bound while the backlog accumulates closed items. Left alone, the files you load at session start grow until they crowd out the work, which is the context-bloat problem again, now caused by the system meant to prevent it.

So the live files are kept small by rotation. When the history crosses a size threshold, the oldest block of sessions moves to a numbered archive and the live file keeps a one-line pointer to it. Closed backlog items leave the live backlog at closure and move to a closed-gaps file, which rotates the same way. The rule is that a file’s length should track what is live, not what is total. Everything historical is one pointer away, never in the hot path, and a small script does the threshold check so it happens without my attention.

A diagram of rotation. The live session history keeps recent sessions; its oldest block is moved verbatim by a script into a numbered archive, leaving a one-line pointer. The closed-gaps ledger rotates the same way. Live files stay small and load every session; archives keep growing, stay greppable, and are not loaded by default.

The rotation is a script for a specific reason, and it is not mainly about tokens. Before the script existed, the agent did the archiving by hand, and that is where I lost information I needed later. Moving an entry is byte-for-byte work, and an LLM does not move text byte-for-byte. Handed a long, detailed gap entry to relocate, it would paraphrase it, trim the detail it judged redundant to save tokens, or mangle it under context pressure, and the part it dropped was often the reasoning I would want months later and could not reconstruct from the code. A script moves the bytes exactly as written. Keeping the live files small is the second benefit. The first is fidelity: the most detailed entries are the ones worth keeping, and a script never rewrites them.

Archiving is not deletion, and that distinction is the whole point of the previous section. The archived sessions and closed gaps stay in the repo and stay greppable. Rotation only moves them out of the files that load by default, so the searchable decision record keeps growing while the context cost of a session stays flat. The worked example in the repo ships already-rotated, so you can read the steady state rather than take my word for it.

One practical question this raises is where the files live. I commit them in the project’s own git repository, in a docs directory, as part of the project. They then version with the code, so a gap row and the commit that closed it share one history, the agent has them checked out and loads them with no extra wiring, and one grep covers code and notes together. The honest cost is that your session notes become part of the repo’s history, which is fine for a solo project or a team that wants them, and wrong for a shared repo where colleagues would rather not carry your worklog. In that case keep them in a separate repository or a gitignored local directory, and accept that you lose the shared history and the single grep. None of this is a hard rule. The only real requirement is that the files persist across sessions and stay reachable by you and the agent. Past that, keep them wherever you are comfortable.

There is one case where it stops being a free choice. I run each session in a fresh, ephemeral container, so nothing the agent writes to its own memory survives the teardown. The built-in memory is useless under those conditions, and the committed documents are the memory that carries across sessions. If your environment is ephemeral too, persisting these files somewhere that outlives the run stops being a preference, and version control is one of the few options that does it well.

Where this stops

None of this is enforced. The worklist rule is a discipline, not a gate: nothing structurally prevents a bad bullet, the system only makes the check cheap enough that I actually do it. The rotation script is simple text handling that depends on the files following a fixed heading format, and it breaks if the format drifts. The handoff system has the same weakness as a process rule on a human team: it holds exactly as long as the person at the keyboard keeps applying it. The thing that makes it stick in practice is wiring the close routine into a single command, so doing the handoff well is less work than skipping it.

The other half: standing rules

These three documents are only half of what keeps an agent on course across sessions. They carry state: what happened, what is still open, what to do next. State is the half that changes every session, and it is the half that rots if you let it, which is why most of this post is about keeping it clean.

The other half does not change session to session, and it does not rot the way a worklist does, because it is not a record of anything that happened. It is the standing rules: the conventions, the libraries, and the constraints the agent loads at the start of every session and is held to while it works. In my setup that is the always-loaded rules file (CLAUDE.md, or AGENTS.md depending on the agent), a generated map of the codebase that a script keeps current so it can never drift or be hallucinated, and a small enforcement skill that makes the agent check its own work against the rules before it calls a task done. The same instinct runs through both halves: keep the mechanical, must-be-exact work in scripts, and keep the judgment in the documents you read.

That half is its own machinery and its own discipline, enough that it gets its own post alongside this one rather than a paragraph here, and I am keeping this post to the state half. The steering-coding-agents repository carries both halves: the handoff documents from this post, and the standing-rules pieces from that one.

Why steered, not autonomous

This whole setup assumes a human in the loop. The three documents are the shared memory between me and the agent across sessions, and they are where my steering decisions get written down so they survive a context reset. The next session inherits them instead of relitigating them. A hands-off loop needs different machinery, because it cannot ask me what comes next. This is the hands-on version, and its only job is to make sure I have to make each decision once rather than every session.

Most of what I write on this blog is about a coding pipeline. The operator and the model settle the design together in a grounding stage first, and from there the pipeline builds the feature end to end, with no human in the loop while it executes. I build that pipeline with the hand-steered workflow described here, rather than pointing the pipeline at itself, and that is deliberate. The pipeline runs work whose shape is settled before execution begins. Building the pipeline is the other kind of work: open-ended, where the shape keeps changing as you go and the judgment calls are the whole task. That is work a human steers directly. The long-term goal is to need less of this over time, by moving more of the work into the form the pipeline can run on its own, and that part is not finished yet.

The autonomous pipeline has the same cross-session memory problem this workflow solves by hand, and for the same underlying reason: it runs on a fresh slate every time, the way each of my sessions boots a clean container. The pipeline starts every run blind, so I onboard it the way you onboard a new engineer, feeding it the conventions, the libraries, and the standard procedures it needs to act correctly. Project history is the same kind of onboarding material, the institutional reasoning a long-tenured engineer carries that a new hire lacks, and a system like this one, mine or something off the shelf, could hand the pipeline that history the way I hand it everything else. What I run today is manual and works well, and at some point it makes sense to build that discipline into the pipeline rather than run it alongside by hand.

The templates, the worked example, and the rotation script are in the steering-coding-agents repo. Copy them into your own project’s docs and adapt the first few entries.