Agents Aren't Chatbots (Part 2)

This is part 2 of a series. If you haven't read part 1, start there — it covers the hypothesis, the approach, and the first round of results. This post picks up directly from that data.

After the first round of results, the pattern was clear on long sessions but I only had a handful of T3 tasks to look at. I wanted to know what happened at real complexity: tasks that take 15, 20, 30 tool calls to complete. That's where the baseline context window grows into real money, and where constant context should compound most visibly.

So I ran more. 12 new tasks across T3 and T4 complexity. Here's what happened.

The animation

Before the numbers: here's what the two approaches actually look like as a session runs.

Baseline — accumulating context

Projection — constant context

0 tokens

Turn 0 / 22

The baseline bar keeps growing. The projection bar rises and then levels off once the world cache is warm. On the regex engine task that played out as 120,718 baseline tokens vs 56,545 projection. Same implementation, same tests, same pass.

The full data

Running 23 tasks total now across T2, T3, and T4 complexity:

Task	Complexity	B tokens	P tokens	Ratio	B tools	P tools
t2_callback_to_async	T2	10,021	15,429	1.54x	7	7
t2_failing_test_suite	T2	9,346	19,872	2.13x	6	8
t2_lru_cache	T2	5,114	7,137	1.40x	3	3
t3_async_pipeline_bug	T3	12,848	19,802	1.54x	7	9
t3_async_rate_limiter	T3	7,213	7,849	1.09x	4	3
t3_concurrency_race	T3	4,143	5,827	1.41x	3	3
t3_concurrent_bounded_buffer	T3	8,467	13,338	1.58x	4	5
t3_diff_engine	T3	64,757	27,677	0.43x	19	13
t3_dijkstra_negative_weights	T3	14,854	23,591	1.59x	7	8
t3_god_class	T3	7,034	9,904	1.41x	4	4
t3_lfu_cache	T3	21,687	18,424	0.85x	7	6
t3_mini_interpreter	T3	13,082	28,641	2.19x	6	11
t3_persistent_queue	T3	11,966	12,745	1.07x	6	5
t3_split_monolith	T3	22,201	32,413	1.46x	14	14
t3_trie_autocomplete	T3	14,996	22,318	1.49x	6	8
t4_btree	T4	73,551	39,051	0.53x	30	16
t4_consistent_hashing	T4	15,316	22,925	1.50x	8	8
t4_graph_algorithms	T4	12,807	20,615	1.61x	6	8
t4_job_scheduler	T4	12,658	24,895	1.97x	10	7
t4_lsm_storage	T4	44,554	24,278	0.54x	17	11
t4_pratt_parser	T4	37,150	35,595	0.96x	30	9
t4_regex_engine	T4	120,718	56,545	0.47x	22	14
t4_skiplist	T4	20,374	15,453	0.76x	8	7
Total		552,907	501,328	0.91x

23/23 pass rate.

Overall ratio across all 23 tasks: 0.91x. Projection now beats baseline in aggregate.

What changed with T4

The short tasks still cost more under projection. That hasn't changed and it makes sense. But the T4 results shifted the aggregate number and revealed something I didn't expect.

Look at the tool call column. On t4_pratt_parser: baseline took 30 tool calls, projection took 9. Same task, same tests, all 30 assertions passing in both cases. Projection didn't just use fewer tokens — the agent reached the solution more directly.

That's not supposed to happen. The projection agent has less context available, not more. Why would it be more efficient?

My current theory: the baseline agent carries its own prior reasoning in context. It said "I'll fix the grouping logic next" four turns ago and it can still see that. So it does. Then it says something else and it can see that too. The growing history creates a kind of inertia — the agent follows the thread of its own narration even when a more direct path exists. The projection agent doesn't have that thread. It looks at the current state of the file, the test output from the last run, and decides what to do next. Sometimes that produces a more direct route.

I'm not confident in this explanation. It's a hypothesis that needs more data. But the pattern is consistent enough across the T4 tasks to be worth taking seriously.

t4_pratt_parser — baseline transcript (30 tool calls, 37,150 tokens)

Turn 1  [context: 386 chars, 1 message]
  → run_command: find . -name "*.py" -type f

Turn 2  [context: 905 chars, 3 messages]
  → read_file: ./test_parser.py

Turn 3  [context: 3,937 chars, 5 messages]
  → read_file: ./parser.py

Turn 4  [context: 5,044 chars, 7 messages]
  → write_file: parser.py  [initial implementation]

Turn 5  [context: 17,421 chars, 9 messages]
  → run_command: python3 -m pytest test_parser.py -v

Turn 6  [context: 19,842 chars, 11 messages]
  → write_file: parser.py  [fix precedence]

Turn 7  [context: 33,156 chars, 13 messages]
  → run_command: python3 -m pytest test_parser.py -v

Turn 8  [context: 35,212 chars, 15 messages]
  → write_file: parser.py  [fix right-associativity]

Turn 9  [context: 49,019 chars, 17 messages]
  → run_command: python3 -m pytest test_parser.py -v

...14 more turns of iteration, context growing to 109,341 chars...

Turn 30 [context: 109,341 chars, 59 messages]
  → write_file: VERIFICATION.txt
  
Result: PASS ✅  30/30 tests  37,150 input tokens

t4_pratt_parser — projection transcript (9 tool calls, 35,595 tokens)

Turn 1  [context: 1 message — goal only]
  → run_command: find . -name "*.py"

Turn 2  [context: 5 messages — goal + cached find output]
  ⟨hypothesis: test_parser.py contains requirements | action: read_file(test_parser.py)⟩
  → read_file: test_parser.py

Turn 3  [context: 7 messages — goal + find + test file]
  ⟨hypothesis: parser.py has a stub to fill | action: read_file(parser.py)⟩
  → read_file: parser.py

Turn 4  [context: 9 messages — goal + find + test + parser stub]
  ⟨hypothesis: I have enough to write the full implementation | action: write_file⟩
  → write_file: parser.py  [complete Pratt parser, 380 lines]

Turn 5  [context: 9 messages — goal + find + test + NEW parser.py]
  → run_command: python3 -m pytest test_parser.py -v

Turn 6  [context: 9 messages — goal + NEW parser.py + pytest output]
  ⟨hypothesis: right-associativity for ** needs fixing | action: write_file⟩
  → write_file: parser.py  [fix exponentiation]

Turn 7  [context: 9 messages — goal + UPDATED parser.py + last pytest]
  → run_command: python3 -m pytest test_parser.py -v

Turn 8  [context: 9 messages — goal + UPDATED parser.py + new pytest]
  ⟨hypothesis: built-in functions need arity handling | action: write_file⟩
  → write_file: parser.py  [fix builtins]

Turn 9  [context: 9 messages — goal + FINAL parser.py + last pytest]
  → run_command: python3 -m pytest test_parser.py -v

Result: PASS ✅  30/30 tests  35,595 input tokens

The baseline ran 21 more tool calls for nearly the same token cost. The projection agent's context stayed at 9 messages throughout — goal, current file state, last action. Each turn it had everything it needed and nothing it didn't.

t4_regex_engine — baseline transcript (22 tool calls, 120,718 tokens)

Turn 1  [context: 301 chars, 1 message]
  → run_command: ls -la

Turn 2  [context: 1,021 chars, 3 messages]
  → read_file: regex.py
  → read_file: test_regex.py

Turn 3  [context: 7,656 chars, 5 messages]
  → write_file: regex.py  [initial NFA implementation, 11,898 bytes]

Turn 4  [context: 20,598 chars, 7 messages]
  → run_command: python -m pytest test_regex.py -v
  [FAIL: python not found]

Turn 5  [context: 21,070 chars, 9 messages]
  → run_command: which python3

...context grows with each iteration: 34k, 47k, 63k, 79k, 93k, 108k chars...

Turn 22 [context: ~119,000 chars, 43 messages]
  → run_command: python3 -m pytest test_regex.py -v

Result: PASS ✅  120,718 input tokens

t4_regex_engine — projection transcript (14 tool calls, 56,545 tokens)

Turn 1  [context: 1 message — goal only]
  → run_command: find . -type f -name "*.py"

Turn 2  [context: 5 messages — goal + find cached]
  ⟨hypothesis: test_regex.py has requirements | action: read_file(test_regex.py)⟩
  → read_file: test_regex.py

Turn 3  [context: 7 messages — goal + find + test file]
  → read_file: regex.py

Turn 4  [context: 9 messages — goal + test + regex stub]
  ⟨hypothesis: I have enough context to write full NFA implementation⟩
  → write_file: regex.py  [full implementation]

Turn 5  [context: 9 messages — goal + NEW regex.py + last action]
  → run_command: python3 -m pytest test_regex.py -v

...8 more turns of targeted iteration, context stays constant at ~9 messages...

Turn 14 [context: 9 messages — goal + FINAL regex.py + last pytest]
  → run_command: python3 -m pytest test_regex.py -v

Result: PASS ✅  56,545 input tokens  (0.47x baseline)

t4_btree — baseline vs projection summary

BASELINE
  30 tool calls
  73,551 input tokens
  Context at final turn: ~85,000 chars, 59 messages
  Result: PASS ✅

PROJECTION
  16 tool calls
  39,051 input tokens  (0.53x)
  Context at every turn: ~9 messages (constant)
  Result: PASS ✅

The breakeven line is becoming clearer

More data makes the pattern more precise. Tasks with fewer than around 10 baseline tool calls cost more under projection. Tasks above that line start going the other way.

The T4 wins aren't random. They're the tasks where the baseline accumulates the most historical content: repeated reads of large files, multiple rounds of test output, early exploration commands that stay in context forever. Projection doesn't carry any of that. It serves the current state of each file and nothing else.

The t4_mini_interpreter result (2.19x, 11 vs 6 tool calls) is the outlier worth understanding. Projection used nearly double the tool calls. That's the world cache filling up and the agent having to re-read files it had already read. The token budget is too small for that task's file sizes. The eviction policy is kicking out things that should stay. That's a tuning problem, not a fundamental one — but it's a real problem that needs solving before projection can be deployed confidently on tasks with large codebases.

What I'm building toward

These experiments are feeding directly into how I'm thinking about the three tools:

synaxi-predict picks the model before the session starts. The turn count predictions it makes are now also a signal for which context strategy to use. A task predicted at 20+ turns is a candidate for projection. A task predicted at 5 is probably better served by baseline with compression.

The projection engine (what these experiments are testing) replaces the accumulating history with constant context reconstructed from current reality. The data says this works at T4 complexity. The open questions are about the world cache: how big should it be, what should the eviction policy be, and how does it handle large codebases with many files.

Synaxi compresses the wire format of whatever context gets sent. Under projection, it's compressing a constant-size context rather than a growing one. The gains per request are smaller, but they compound across a session that doesn't grow.

The three layers address the same problem at different levels. Right model, right context structure, minimal wire overhead. The experiments here are filling in the middle layer.

More to come.

Synaxi is the Mac app that reduces token usage on your existing Claude Code sessions, no configuration required.

The animation

The full data

What changed with T4

The breakeven line is becoming clearer

What I'm building toward

Get new posts in your inbox.