Agents Aren't Chatbots (Part 1)

Claude Code is built on the same foundation as every other chat AI: a messages array that grows with each turn. That's not a flaw, it's the model. Conversation history is how these systems were designed and trained, and it works.

But agents are a different use case than chatbots. A coding agent working through a multi-step task isn't really having a conversation. It's executing a process. And I started wondering whether the chat format, inherited almost by accident, is actually the right way to frame what it's doing.

So I ran an experiment: throw out the conversation history entirely, replace it with a projection of current reality, and see if the agent still succeeds. Does it need the history, or does it just need to know where things stand right now?

After 11 tasks and about a week of iteration: yes. And on longer tasks, it does so with fewer tokens than baseline.

The hypothesis

The chat interface was designed for humans having conversations. Agents inherited it. Every framework, every harness, every tool builds on top of an interface that was never designed for long-running autonomous tasks.

The problem isn't that the history is large. It's that it's additive. Each turn carries all prior turns. File contents read at turn 3 are re-sent at turn 50 even if that file hasn't been touched since. The model pays attention to a window of stale context on every request.

My hypothesis: an agent doesn't need history. It needs to know three things.

What is the goal.
What is the current state of the world.
What just happened.

If you reconstruct those three things from reality on every turn, rather than accumulating them from events, the context never grows. Turn 50 sends the same number of tokens as turn 1. And the model gets a better signal, because it's always looking at what's actually true, not what was true ten turns ago.

What I built

The harness I built runs a coding task twice: once with a normal accumulating baseline, and once with projection mode. In projection mode, every API call gets a freshly constructed messages array built from:

The goal (static, never changes)
Current file contents (always the latest version, never stale)
The last tool call and its result

This isn't a different prompt format. It's the same native Anthropic message structure the model was trained on. File contents appear as read_file / tool_result pairs. Commands appear as run_command / tool_result pairs. The model wakes up each turn inside a familiar pattern. It just can't tell that the history is synthesised rather than accumulated.

The context is managed by a WorldCache: a token-weighted LRU cache of what the agent has observed. Files get cached when read, commands get cached when run. The cache is rebuilt from scratch before each API call. Entries evict when their token cost times their age exceeds the budget. A large file read 10 turns ago evicts before a small ls output read 2 turns ago.

What the context looks like

Here's the structure of what gets sent to the API on turn 6 of a real task, after the agent has read test_diff.py and written diff.py:

Turn 6 context: full structure

[
  {
    "role": "user",
    "content": [{
      "type": "text",
      "text": "Implement diff.py with compute_lcs(a, b), a DiffLine dataclass with kind/content/old_lineno/new_lineno fields, and compute_diff(old_lines, new_lines) returning a list of DiffLines so all pytest tests pass.\n\nWorking directory: /tmp/sandbox_abc123"
    }]
  },
  {
    "role": "assistant",
    "content": [{
      "type": "tool_use",
      "id": "syn_74015",
      "name": "read_file",
      "input": { "path": "/tmp/sandbox_abc123/test_diff.py" }
    }]
  },
  {
    "role": "user",
    "content": [{
      "type": "tool_result",
      "tool_use_id": "syn_74015",
      "content": "import pytest\nfrom diff import compute_lcs, compute_diff, DiffLine\n\ndef test_lcs_basic():\n    assert compute_lcs([1,2,3], [1,2,3]) == [1,2,3]\n..."
    }]
  },
  {
    "role": "assistant",
    "content": [{
      "type": "tool_use",
      "id": "syn_8949",
      "name": "read_file",
      "input": { "path": "/tmp/sandbox_abc123/diff.py" }
    }]
  },
  {
    "role": "user",
    "content": [{
      "type": "tool_result",
      "tool_use_id": "syn_8949",
      "content": "from dataclasses import dataclass\n\n@dataclass\nclass DiffLine:\n    kind: str\n    content: str\n    old_lineno: int\n    new_lineno: int\n\ndef compute_lcs(a, b):\n    # current implementation\n    ..."
    }]
  },
  {
    "role": "assistant",
    "content": [{
      "type": "tool_use",
      "id": "tu_07",
      "name": "run_command",
      "input": { "command": "python3 -m pytest test_diff.py -v" }
    }]
  },
  {
    "role": "user",
    "content": [{
      "type": "tool_result",
      "tool_use_id": "tu_07",
      "content": ".F\nFAILED test_lcs_strings - assert 0 == 4\n1 failed, 1 passed"
    }]
  }
]

The synthesised read_file pairs at the top are the projection. The real last action is at the bottom. The model sees a seamless, familiar sequence: it read the test file, it read the implementation, it ran pytest, here's what happened. Now what?

At turn 20, the structure is identical. The file contents reflect the current state of diff.py, whatever it looks like now after the agent's edits. The history from turns 1 to 19 is gone. The model doesn't need it.

The results

I ran 11 tasks from the benchmark suite, a mix of T2 (moderate complexity) and T3 (complex, multi-step) tasks across code generation, debugging, and refactoring. Every task ran both baseline and projection with the same model (Haiku) and same turn limit.

Task	Baseline	Projection	Baseline tokens	Projection tokens	Ratio
t2_callback_to_async	✅	✅	10,021	15,429	1.54x
t2_failing_test_suite	✅	✅	9,346	19,872	2.13x
t2_lru_cache	✅	✅	5,114	7,137	1.40x
t3_async_pipeline_bug	✅	✅	12,848	19,802	1.54x
t3_concurrent_bounded_buffer	✅	✅	8,467	13,338	1.58x
t3_diff_engine	✅	✅	64,757	27,677	0.43x
t3_dijkstra_negative_weights	✅	✅	14,854	23,591	1.59x
t3_god_class	✅	✅	7,034	9,904	1.41x
t3_lfu_cache	✅	✅	21,687	18,424	0.85x
t3_persistent_queue	✅	✅	11,966	12,745	1.07x
t3_split_monolith	✅	✅	22,201	32,413	1.46x
Total	11/11	11/11	188,295	200,332	1.06x

11/11 pass rate on both sides. Projection never failed a task that baseline passed.

Aggregate token overhead: 1.06x, six percent above baseline across the full test suite.

And the direction is exactly what I predicted. The two tasks where projection beat baseline on tokens (t3_diff_engine at 0.43x, t3_lfu_cache at 0.85x) are the longest baseline sessions. t3_diff_engine took the baseline 19 tool calls to solve; projection solved it in 13 and used less than half the tokens. The baseline was carrying history. Projection was carrying current state.

The short, simple tasks cost more with projection because the overhead of reconstructing the world block each turn outweighs the savings from not accumulating history. There isn't much history to accumulate anyway. On longer sessions, that math inverts.

What the agent experiences

The agent's behaviour under projection is noticeably different, in ways that make sense once you understand the context structure.

Under baseline, the agent can refer back to its own prior reasoning. It said "I'll fix the LCS algorithm" three turns ago and can see that. Under projection, it can't. What it can see is the current state of the file it was working on. If the fix is in there, that's the evidence. If it's not, that's the evidence too.

This turns out to be fine, because the filesystem is a better memory than conversation history. The agent doesn't need to remember that it decided to fix compute_lcs. It just needs to read diff.py and see whether compute_lcs is correct. The file tells it everything the history would have told it, more reliably, because the file is always current.

Example: projection agent diagnosing a bug across turns

Here's a real sequence from t3_diff_engine. The agent is working on a text diff implementation. At turn 5, it runs pytest and sees a failure:

FAILED test_lcs_strings - assert 0 == 4

The test passes "ABCBDAB".split() to compute_lcs. The agent thinks through what's happening and says so before it acts:

⟨hypothesis: "ABCBDAB".split() with no separator returns a single-element list,
not a list of characters — compute_lcs is receiving ['ABCBDAB'] instead of
['A','B','C','B','D','A','B'], which is why the LCS length is 0 instead of 4 |
action: verify with a quick python check before editing⟩

Then it runs the debug command:

run_command(python3 -c "print('ABCBDAB'.split())")
→ ['ABCBDAB']

Hypothesis confirmed. At turn 6, the world block shows the current diff.py (with the bug still in it) plus the cached ls output from earlier. The last tool result shows ['ABCBDAB']. The agent annotates again:

⟨hypothesis: test uses list("ABCBDAB") or split("") — need to check test file |
observation: test file is already in world state — checking now⟩

It reads the world block, sees test_diff.py is already there, identifies the test uses list("ABCBDAB") (not .split()), and writes the fix directly to diff.py.

At turn 7, the world block shows the updated diff.py — the version just written, not the old one. The agent runs pytest. All tests pass.

The annotation format serves a real function here: the hypothesis is in the output token stream, so the model is committing to a theory before acting on it. When the next turn arrives with a fresh context and no history, the last tool result either confirms or refutes that theory. The agent doesn't need to remember what it was thinking — it can read what happened.

What went wrong, and what it taught me

Not everything worked on the first try. A few things I had to fix:

The feature file parser was truncating multi-file tasks. My benchmark tasks are defined in Gherkin feature files with === filename === delimiters. When a file inside the block contained a Python docstring ("""..."""), the parser terminated the block early, thinking it had found the closing delimiter. The Dijkstra task has a graph implementation with a docstring on line 2, so test_graph.py was never being written to the sandbox. The agent spent 25 turns looking for tests that didn't exist. Fixing the regex fixed the task.

run_command results don't persist across turns. File contents go into the world cache because files are named, identifiable state. Command output doesn't have a natural identity; two runs of the same command might produce different results. But some commands do reveal stable state: ls, find, pytest --collect-only. I cache those too, keyed by the full command string. An agent that runs ls sees the result in its context the next turn, without running ls again.

The context format matters more than I expected. My first projection implementation wrapped everything in XML tags: <goal>, <world>, <last_step>. The agent would read a file, get the content, and then on the next turn say "let me check if that file exists." It didn't trust the XML block the way it trusted a tool_result. Switching to the native Anthropic message format, using synthesised read_file / tool_result pairs, fixed this completely. The model was trained on that format. It trusts it.

The pattern I'm seeing

The data supports the hypothesis, but with an important nuance.

Projection costs more on short sessions. For a task that takes 3 to 5 tool calls, the world reconstruction overhead exceeds the savings from not carrying history, because there's not much history to save on. You're paying to rebuild a world block that doesn't contain much, on every turn of a session that doesn't have many turns.

Projection costs less on long sessions. Once baseline sessions start accumulating substantial history: repeated file reads, multiple rounds of test output, tool results from early exploration, the projection starts to win. t3_diff_engine's baseline session was 19 tool calls with a history that grew each time. Projection stayed flat. That's where the 0.43x ratio comes from.

The breakeven point in my data is around 10 to 15 tool calls. Below that, projection costs more. Above that, projection starts to save. And the further above it you go, the more it saves, because baseline history grows without bound while projection context stays constant.

At real session lengths, a Claude Code session working on a meaningful task often runs 50 to 100 turns, this difference is not marginal. A session that would cost $0.12 at baseline might cost $0.04 under projection, because the input token count at turn 100 is roughly the same as at turn 1 rather than 20x larger.

Where this fits in a bigger picture

Before I get into what's next experimentally, it's worth stepping back and explaining why I'm exploring this at all and how it relates to the other things I'm building.

I think agent efficiency is a three-layer problem, and right now most tools only address one layer.

The first layer is model selection. Choosing the wrong model is the most expensive mistake you can make. No amount of optimization fixes a session that runs 80 turns on Sonnet when Haiku would have passed the task in 20. This is what synaxi-predict addresses: before the session starts, analyze the task and the repo, predict turn count and complexity, recommend the right model.

The second layer is context architecture. This is what this experiment is about. The chat accumulation model is an inheritance from chatbots, not a deliberate design choice for agents. Constant context, grounded in current reality rather than conversation history, is a different primitive. I don't know yet what the right version of this looks like, but the early results suggest it's worth pursuing.

The third layer is wire compression. Once you've selected the right model and constructed the right context, Synaxi compresses the request before it leaves your machine: stripping structural overhead, compacting JSON, deduplicating anything redundant. You're squeezing an already-optimized context rather than chasing an ever-growing one. The gains are smaller per request but the math is better because the target is constant.

The three layers are meant to work together. synaxi-predict's turn predictions can inform how aggressively the context engine should budget its world cache. The context engine's constant-size output makes Synaxi's compression ratios predictable and consistent across the full session. The actuals synaxi-predict collects get richer as the other two layers mature. Over time the system learns not just which model to pick, but which context strategy suits which kind of task.

I'm thinking about this as one thing, not three separate tools. The projection experiment is the middle layer taking shape.

What I'm testing next

There's a lot still to figure out on the constant context approach specifically.

LRU eviction tuning. My current token budget is fixed at 8,000 tokens. I don't know yet whether that's the right number, or whether a dynamic budget based on task complexity produces better results. The eviction policy (token cost x age) is a first guess. There's probably a better one, and the right answer might vary by task type.

Longer sessions. My benchmark tasks are bounded; they're designed to be solvable in 20 to 25 turns. I need to test on tasks where the baseline session would genuinely run to context limits. That's where the projection advantage should compound most visibly, and that's the case I care most about getting right.

Different approaches to world state. Projection is one method of constructing constant context but probably not the only one. The right structure for world state, what goes in it, how it's organized, how much of it the model actually uses, these are all open questions I want to run experiments on.

The FUSE layer. The harness simulates the projection mechanism at the API level by tracking tool calls. The more robust version lives in the filesystem layer. When Synaxi mounts a FUSE filesystem at the agent's working directory, every file read and write passes through the driver. The world cache becomes a live view of the filesystem rather than a reconstruction from tool call history. Files that changed since the agent last read them are silently updated. The agent never operates on stale data, not because I told it to re-read, but because the layer serving the file always serves fresh.

This is still early. But the core question has an answer: constant context works, agents succeed, and on the sessions that matter most the token savings are real. The rest is iteration.

Synaxi is the Mac app that reduces token usage on your existing Claude Code sessions, no configuration required.