This is part 2 of a series. If you haven't read part 1, start there — it covers the hypothesis, the approach, and the first round of results. This post picks up directly from that data.


After the first round of results, the pattern was clear on long sessions but I only had a handful of T3 tasks to look at. I wanted to know what happened at real complexity: tasks that take 15, 20, 30 tool calls to complete. That's where the baseline context window grows into real money, and where constant context should compound most visibly.

So I ran more. 12 new tasks across T3 and T4 complexity. Here's what happened.

The animation

Before the numbers: here's what the two approaches actually look like as a session runs.

Baseline — accumulating context
Projection — constant context
0 tokens
0 tokens
Turn 0 / 22

The baseline bar keeps growing. The projection bar rises and then levels off once the world cache is warm. On the regex engine task that played out as 120,718 baseline tokens vs 56,545 projection. Same implementation, same tests, same pass.

The full data

Running 23 tasks total now across T2, T3, and T4 complexity:

Task Complexity B tokens P tokens Ratio B tools P tools
t2_callback_to_async T2 10,021 15,429 1.54x 7 7
t2_failing_test_suite T2 9,346 19,872 2.13x 6 8
t2_lru_cache T2 5,114 7,137 1.40x 3 3
t3_async_pipeline_bug T3 12,848 19,802 1.54x 7 9
t3_async_rate_limiter T3 7,213 7,849 1.09x 4 3
t3_concurrency_race T3 4,143 5,827 1.41x 3 3
t3_concurrent_bounded_buffer T3 8,467 13,338 1.58x 4 5
t3_diff_engine T3 64,757 27,677 0.43x 19 13
t3_dijkstra_negative_weights T3 14,854 23,591 1.59x 7 8
t3_god_class T3 7,034 9,904 1.41x 4 4
t3_lfu_cache T3 21,687 18,424 0.85x 7 6
t3_mini_interpreter T3 13,082 28,641 2.19x 6 11
t3_persistent_queue T3 11,966 12,745 1.07x 6 5
t3_split_monolith T3 22,201 32,413 1.46x 14 14
t3_trie_autocomplete T3 14,996 22,318 1.49x 6 8
t4_btree T4 73,551 39,051 0.53x 30 16
t4_consistent_hashing T4 15,316 22,925 1.50x 8 8
t4_graph_algorithms T4 12,807 20,615 1.61x 6 8
t4_job_scheduler T4 12,658 24,895 1.97x 10 7
t4_lsm_storage T4 44,554 24,278 0.54x 17 11
t4_pratt_parser T4 37,150 35,595 0.96x 30 9
t4_regex_engine T4 120,718 56,545 0.47x 22 14
t4_skiplist T4 20,374 15,453 0.76x 8 7
Total 552,907 501,328 0.91x

23/23 pass rate.

Overall ratio across all 23 tasks: 0.91x. Projection now beats baseline in aggregate.

What changed with T4

The short tasks still cost more under projection. That hasn't changed and it makes sense. But the T4 results shifted the aggregate number and revealed something I didn't expect.

Look at the tool call column. On t4_pratt_parser: baseline took 30 tool calls, projection took 9. Same task, same tests, all 30 assertions passing in both cases. Projection didn't just use fewer tokens — the agent reached the solution more directly.

That's not supposed to happen. The projection agent has less context available, not more. Why would it be more efficient?

My current theory: the baseline agent carries its own prior reasoning in context. It said "I'll fix the grouping logic next" four turns ago and it can still see that. So it does. Then it says something else and it can see that too. The growing history creates a kind of inertia — the agent follows the thread of its own narration even when a more direct path exists. The projection agent doesn't have that thread. It looks at the current state of the file, the test output from the last run, and decides what to do next. Sometimes that produces a more direct route.

I'm not confident in this explanation. It's a hypothesis that needs more data. But the pattern is consistent enough across the T4 tasks to be worth taking seriously.

t4_pratt_parser — baseline transcript (30 tool calls, 37,150 tokens)
Turn 1  [context: 386 chars, 1 message]
  → run_command: find . -name "*.py" -type f

Turn 2  [context: 905 chars, 3 messages]
  → read_file: ./test_parser.py

Turn 3  [context: 3,937 chars, 5 messages]
  → read_file: ./parser.py

Turn 4  [context: 5,044 chars, 7 messages]
  → write_file: parser.py  [initial implementation]

Turn 5  [context: 17,421 chars, 9 messages]
  → run_command: python3 -m pytest test_parser.py -v

Turn 6  [context: 19,842 chars, 11 messages]
  → write_file: parser.py  [fix precedence]

Turn 7  [context: 33,156 chars, 13 messages]
  → run_command: python3 -m pytest test_parser.py -v

Turn 8  [context: 35,212 chars, 15 messages]
  → write_file: parser.py  [fix right-associativity]

Turn 9  [context: 49,019 chars, 17 messages]
  → run_command: python3 -m pytest test_parser.py -v

...14 more turns of iteration, context growing to 109,341 chars...

Turn 30 [context: 109,341 chars, 59 messages]
  → write_file: VERIFICATION.txt
  
Result: PASS ✅  30/30 tests  37,150 input tokens
t4_pratt_parser — projection transcript (9 tool calls, 35,595 tokens)
Turn 1  [context: 1 message — goal only]
  → run_command: find . -name "*.py"

Turn 2  [context: 5 messages — goal + cached find output]
  ⟨hypothesis: test_parser.py contains requirements | action: read_file(test_parser.py)⟩
  → read_file: test_parser.py

Turn 3  [context: 7 messages — goal + find + test file]
  ⟨hypothesis: parser.py has a stub to fill | action: read_file(parser.py)⟩
  → read_file: parser.py

Turn 4  [context: 9 messages — goal + find + test + parser stub]
  ⟨hypothesis: I have enough to write the full implementation | action: write_file⟩
  → write_file: parser.py  [complete Pratt parser, 380 lines]

Turn 5  [context: 9 messages — goal + find + test + NEW parser.py]
  → run_command: python3 -m pytest test_parser.py -v

Turn 6  [context: 9 messages — goal + NEW parser.py + pytest output]
  ⟨hypothesis: right-associativity for ** needs fixing | action: write_file⟩
  → write_file: parser.py  [fix exponentiation]

Turn 7  [context: 9 messages — goal + UPDATED parser.py + last pytest]
  → run_command: python3 -m pytest test_parser.py -v

Turn 8  [context: 9 messages — goal + UPDATED parser.py + new pytest]
  ⟨hypothesis: built-in functions need arity handling | action: write_file⟩
  → write_file: parser.py  [fix builtins]

Turn 9  [context: 9 messages — goal + FINAL parser.py + last pytest]
  → run_command: python3 -m pytest test_parser.py -v

Result: PASS ✅  30/30 tests  35,595 input tokens

The baseline ran 21 more tool calls for nearly the same token cost. The projection agent's context stayed at 9 messages throughout — goal, current file state, last action. Each turn it had everything it needed and nothing it didn't.

t4_regex_engine — baseline transcript (22 tool calls, 120,718 tokens)
Turn 1  [context: 301 chars, 1 message]
  → run_command: ls -la

Turn 2  [context: 1,021 chars, 3 messages]
  → read_file: regex.py
  → read_file: test_regex.py

Turn 3  [context: 7,656 chars, 5 messages]
  → write_file: regex.py  [initial NFA implementation, 11,898 bytes]

Turn 4  [context: 20,598 chars, 7 messages]
  → run_command: python -m pytest test_regex.py -v
  [FAIL: python not found]

Turn 5  [context: 21,070 chars, 9 messages]
  → run_command: which python3

...context grows with each iteration: 34k, 47k, 63k, 79k, 93k, 108k chars...

Turn 22 [context: ~119,000 chars, 43 messages]
  → run_command: python3 -m pytest test_regex.py -v

Result: PASS ✅  120,718 input tokens
t4_regex_engine — projection transcript (14 tool calls, 56,545 tokens)
Turn 1  [context: 1 message — goal only]
  → run_command: find . -type f -name "*.py"

Turn 2  [context: 5 messages — goal + find cached]
  ⟨hypothesis: test_regex.py has requirements | action: read_file(test_regex.py)⟩
  → read_file: test_regex.py

Turn 3  [context: 7 messages — goal + find + test file]
  → read_file: regex.py

Turn 4  [context: 9 messages — goal + test + regex stub]
  ⟨hypothesis: I have enough context to write full NFA implementation⟩
  → write_file: regex.py  [full implementation]

Turn 5  [context: 9 messages — goal + NEW regex.py + last action]
  → run_command: python3 -m pytest test_regex.py -v

...8 more turns of targeted iteration, context stays constant at ~9 messages...

Turn 14 [context: 9 messages — goal + FINAL regex.py + last pytest]
  → run_command: python3 -m pytest test_regex.py -v

Result: PASS ✅  56,545 input tokens  (0.47x baseline)
t4_btree — baseline vs projection summary
BASELINE
  30 tool calls
  73,551 input tokens
  Context at final turn: ~85,000 chars, 59 messages
  Result: PASS ✅

PROJECTION
  16 tool calls
  39,051 input tokens  (0.53x)
  Context at every turn: ~9 messages (constant)
  Result: PASS ✅

The breakeven line is becoming clearer

More data makes the pattern more precise. Tasks with fewer than around 10 baseline tool calls cost more under projection. Tasks above that line start going the other way.

The T4 wins aren't random. They're the tasks where the baseline accumulates the most historical content: repeated reads of large files, multiple rounds of test output, early exploration commands that stay in context forever. Projection doesn't carry any of that. It serves the current state of each file and nothing else.

The t4_mini_interpreter result (2.19x, 11 vs 6 tool calls) is the outlier worth understanding. Projection used nearly double the tool calls. That's the world cache filling up and the agent having to re-read files it had already read. The token budget is too small for that task's file sizes. The eviction policy is kicking out things that should stay. That's a tuning problem, not a fundamental one — but it's a real problem that needs solving before projection can be deployed confidently on tasks with large codebases.

What I'm building toward

These experiments are feeding directly into how I'm thinking about the three tools:

synaxi-predict picks the model before the session starts. The turn count predictions it makes are now also a signal for which context strategy to use. A task predicted at 20+ turns is a candidate for projection. A task predicted at 5 is probably better served by baseline with compression.

The projection engine (what these experiments are testing) replaces the accumulating history with constant context reconstructed from current reality. The data says this works at T4 complexity. The open questions are about the world cache: how big should it be, what should the eviction policy be, and how does it handle large codebases with many files.

Synaxi compresses the wire format of whatever context gets sent. Under projection, it's compressing a constant-size context rather than a growing one. The gains per request are smaller, but they compound across a session that doesn't grow.

The three layers address the same problem at different levels. Right model, right context structure, minimal wire overhead. The experiments here are filling in the middle layer.

More to come.


Synaxi is the Mac app that reduces token usage on your existing Claude Code sessions, no configuration required.