This is part 3 of a series. Part 1 introduced the hypothesis, replacing accumulating conversation history with a constant-size projection of current state, and showed it holding up on 11 tasks. Part 2 scaled that to 23 tasks and found the aggregate token ratio flip in projection's favor once tasks got long enough. This post moves the experiment off the synthetic harness and onto the real Claude Code path, and splits one agent into two to get there.
Parts 1 and 2 proved the idea inside a benchmark harness I built myself: my own agent loop, executing tools, measuring tokens. That's a fair test of the projection mechanism, but it isn't Claude Code. I wanted to know whether constant context still works when a real coding agent, running its real loop, is the thing being measured, not a simulation of one.
The complication is that Claude Code is two things at once: a coding agent and a conversation. You ask it questions, it asks you questions, it shows you diffs and waits for your judgment. That interactive layer needs the full, faithful history, exactly what the user actually said, not a reconstruction of it. Projection works for the part of the session that grinds through tool calls alone; it breaks down where a human is in the loop. So I stopped trying to project one agent and split it into two.
Two agents, one projected
The setup: synaxi-chat is a passthrough orchestrator. It's the agent you actually talk to. It asks clarifying questions one at a time, each with a recommended default, turns your answers into a self-contained brief, and delegates the coding to a second agent. It never touches a file directly, so it never accumulates the kind of history that's expensive to carry, and its own context is left completely alone: full fidelity, nothing projected.
synaxi-worker is the one that does the work. It reads files, writes files, runs commands, and iterates against test failures, the exact loop that made projection worth testing in Parts 1 and 2. Its context is the one I shrink to constant size, rebuilt fresh on every turn from three things: the goal, the current state of the files it's touched, and the last thing it did.
The two agents need to be told apart by the routing layer that sits between Claude Code and the model, and I didn't want that to depend on guessing from the shape of a prompt. A custom agent's markdown body becomes its system prompt verbatim, so synaxi-worker's prompt carries a fixed sentinel string. The routing layer projects a request if and only if the system text contains that sentinel. It's a stamp I control, not a fingerprint that can drift the next time Claude Code's own prompts change shape, and every logged request records whether it matched so the split is auditable after the fact. synaxi-chat runs as the actual session agent so it can own the interactive question flow (a feature Claude Code reserves for the top-level agent, not for anything it spawns); the headless worker beneath it never needs that and never gets it.
The suite, run for real
I ran the same 31-task coding suite from Parts 1 and 2 (10 T2, 12 T3, 9 T4, spanning debugging, generation, and refactor tasks) through the real claude -p path this time, not the harness loop, with the routing layer sitting in front of the model on every call.
| metric | Haiku baseline (growing) | Haiku, two-agent split (cache-aware projection) |
|---|---|---|
| pass rate | 30/31 (97%) | 31/31 (100%) |
| raw context processed | 18,996,236 | 7,441,062 |
| effective billed input | 2,896,477 | 1,893,922 |
| cache-hit ratio | 97% | 92% |
| turns (total / avg) | 460 / 14.8 | 816 / 26.3 |
"Raw context processed" is input plus cache write plus cache read, the tokens the model actually looked at each turn. "Effective billed input" weights those by Anthropic's cache pricing (1x fresh input, 1.25x to 2x cache write, 0.1x cache read), which is the number that actually shows up on the invoice.
The worker matched the baseline on quality and beat it on both cost measures: 2.6x less raw context (7.44M against 19.0M) and 1.53x cheaper billed input (1.89M against 2.90M), even though the baseline is exactly the workload prompt caching was built for: a stable, ever-growing prefix that caches at 97% and re-reads at a tenth of the cost. The baseline's one miss was t3_diff_engine; the projected worker passed it.
I didn't expect the cost win to survive contact with a caching-enabled frontier model. Baseline's growing history is close to the ideal shape for Anthropic's prompt cache: append-only, byte-stable, cached prefix growing every turn. Projection wins anyway, because it pushes so much less raw context through the model that even a worse cache-hit ratio (92% against 97%) doesn't close the gap.
Caching is not automatic, you have to earn it
The first cache-aware version of the projected worker actually lost on billed cost. Because projection rebuilds the entire message array from scratch every turn, a single inherited cache breakpoint sitting on the volatile last message forces Anthropic to reprocess almost the whole prefix each time:
| first attempt, no explicit breakpoints | fixed, breakpoints on stable segments | |
|---|---|---|
| billed input | 3,423,795 | 1,893,922 |
| cache-hit ratio | 85% | 92% |
| raw context | 8,176,934 | 7,441,062 |
Same engine, same 31 tasks. The only change was where I told Anthropic to cache. I moved the explicit breakpoints onto the parts of the rebuilt context that are actually stable turn to turn: the system contract block, the fixed four-tool list, and everything in the world cache up to (but not including) the newest entry, since the world is append-only and everything above the newest addition is byte-identical to last turn. The volatile operational-memory block, which does change turn to turn, moved to the tail so it stops busting the cached prefix. That one change cut billed input by another 1.8x (3.42M down to 1.89M) with nothing else touched. It's the same incremental-caching shape Claude Code already uses on its own growing history; the projected worker just had to reproduce it deliberately instead of inheriting it by accident.
What it costs you
This came with two costs, and I don't want to bury them under the headline numbers.
Turns went up 1.8x (816 against 460, or 26.3 against 14.8 per task). Compressing history means the worker occasionally has to re-derive a fact it had already established two turns ago, because the fact isn't sitting in a scrollback it can glance at. Net cost still wins because each of those extra turns is cheap, but if you're optimizing for wall-clock time instead of dollars, this is the number that matters to you.
Rare tasks can still spiral. On one earlier run, t4_btree ran to 118 turns under projection before finishing. The token-weighted eviction bounds the size of each individual context; it does not bound how many turns the worker takes to converge. A compressed world can still send an agent in a slow, unproductive loop if the eviction policy keeps trading away the wrong entry at the wrong moment. This didn't happen in the 31-task run reported above, but it's happened before and it will happen again on some task I haven't tried yet.
There's also a smaller local-model result worth mentioning even though it isn't the headline: running the same suite through the projected worker against gemma4, an 8B model with no prompt caching available at all, passed 20 of 31 tasks (65%), including 8 of 10 T2 tasks and 10 of 12 T3 tasks. Without projection, the growing transcript (roughly 10 KB of system prompt plus around 60 tool schemas plus every prior turn) overruns what a model that size can hold, and it starts looping instead of finishing. Against a frontier model, projection makes the worker cheaper. Against a small local model, it's the difference between finishing two-thirds of the suite and finishing none of it.
Why the split, not just a smaller budget
I could have tried to make one agent behave differently depending on what it was doing: full history for chat turns, projected history for tool-call turns, inside a single context. I didn't, because the two needs actually conflict. Projection is a lossy, reconstructed view of reality, right for a headless worker that only needs its durable observations, and wrong for a conversation where your exact phrasing is the thing being interpreted. Trying to serve both out of one growing-then-shrinking context risked getting neither one right. Splitting the roles means synaxi-chat keeps full fidelity for talking to you, and synaxi-worker runs on constant-space context for the part of the job that actually benefits from it, without either one compromising the other.
What's still open
Turn count is the real cost now, not tokens. Parts 1 and 2 were entirely about token ratios. This run is the first time turns, not tokens, are the metric moving in the wrong direction. I want to understand whether the 1.8x turn overhead is a fixed cost of reconstructing state, or something the eviction policy can reduce with better tuning.
The pathological loop case needs a bound. t4_btree's 118-turn run is a reminder that a bounded context doesn't guarantee a bounded task. I want a hard stop, or a mechanism that recognizes when the worker is re-deriving the same state repeatedly and hands it more of the world rather than less.
Cache-hit ceiling. Projection tops out around 92% against the baseline's 97%, because the world genuinely grows as the worker touches more files; it will never be a perfectly frozen prefix the way a linear transcript can be. I don't yet know how close to 97% a smarter eviction policy could get.
One model, one suite. These are Haiku 4.5 and gemma4 on the same 31 tasks used in Parts 1 and 2. Treat the ratios as directional. The next thing I want is a second frontier model and a second, independently built task suite, so I can tell the difference between "this is how projection behaves" and "this is how projection behaves on tasks I happened to write."
The pattern from Parts 1 and 2 holds up outside the simulation: a coding agent needs to know where things stand right now, not remember its own history, to do good work. Splitting the roles was what let me prove that on the real path without breaking the half of the session that actually needs to remember what you said.
Synaxi is the Mac app that reduces token usage on your existing Claude Code sessions, no configuration required.