This is part 2 of a series. If you haven't read part 1, start there — it covers the hypothesis, the approach, and the first round of results. This post picks up directly from that data.
After the first round of results, the pattern was clear on long sessions but I only had a handful of T3 tasks to look at. I wanted to know what happened at real complexity: tasks that take 15, 20, 30 tool calls to complete. That's where the baseline context window grows into real money, and where constant context should compound most visibly.
So I ran more. 12 new tasks across T3 and T4 complexity. Here's what happened.
The animation
Before the numbers: here's what the two approaches actually look like as a session runs.
The baseline bar keeps growing. The projection bar rises and then levels off once the world cache is warm. On the regex engine task that played out as 120,718 baseline tokens vs 56,545 projection. Same implementation, same tests, same pass.
The full data
Running 23 tasks total now across T2, T3, and T4 complexity:
| Task | Complexity | B tokens | P tokens | Ratio | B tools | P tools |
|---|---|---|---|---|---|---|
| t2_callback_to_async | T2 | 10,021 | 15,429 | 1.54x | 7 | 7 |
| t2_failing_test_suite | T2 | 9,346 | 19,872 | 2.13x | 6 | 8 |
| t2_lru_cache | T2 | 5,114 | 7,137 | 1.40x | 3 | 3 |
| t3_async_pipeline_bug | T3 | 12,848 | 19,802 | 1.54x | 7 | 9 |
| t3_async_rate_limiter | T3 | 7,213 | 7,849 | 1.09x | 4 | 3 |
| t3_concurrency_race | T3 | 4,143 | 5,827 | 1.41x | 3 | 3 |
| t3_concurrent_bounded_buffer | T3 | 8,467 | 13,338 | 1.58x | 4 | 5 |
| t3_diff_engine | T3 | 64,757 | 27,677 | 0.43x | 19 | 13 |
| t3_dijkstra_negative_weights | T3 | 14,854 | 23,591 | 1.59x | 7 | 8 |
| t3_god_class | T3 | 7,034 | 9,904 | 1.41x | 4 | 4 |
| t3_lfu_cache | T3 | 21,687 | 18,424 | 0.85x | 7 | 6 |
| t3_mini_interpreter | T3 | 13,082 | 28,641 | 2.19x | 6 | 11 |
| t3_persistent_queue | T3 | 11,966 | 12,745 | 1.07x | 6 | 5 |
| t3_split_monolith | T3 | 22,201 | 32,413 | 1.46x | 14 | 14 |
| t3_trie_autocomplete | T3 | 14,996 | 22,318 | 1.49x | 6 | 8 |
| t4_btree | T4 | 73,551 | 39,051 | 0.53x | 30 | 16 |
| t4_consistent_hashing | T4 | 15,316 | 22,925 | 1.50x | 8 | 8 |
| t4_graph_algorithms | T4 | 12,807 | 20,615 | 1.61x | 6 | 8 |
| t4_job_scheduler | T4 | 12,658 | 24,895 | 1.97x | 10 | 7 |
| t4_lsm_storage | T4 | 44,554 | 24,278 | 0.54x | 17 | 11 |
| t4_pratt_parser | T4 | 37,150 | 35,595 | 0.96x | 30 | 9 |
| t4_regex_engine | T4 | 120,718 | 56,545 | 0.47x | 22 | 14 |
| t4_skiplist | T4 | 20,374 | 15,453 | 0.76x | 8 | 7 |
| Total | 552,907 | 501,328 | 0.91x |
23/23 pass rate.
Overall ratio across all 23 tasks: 0.91x. Projection now beats baseline in aggregate.
What changed with T4
The short tasks still cost more under projection. That hasn't changed and it makes sense. But the T4 results shifted the aggregate number and revealed something I didn't expect.
Look at the tool call column. On t4_pratt_parser: baseline took 30 tool calls, projection took 9. Same task, same tests, all 30 assertions passing in both cases. Projection didn't just use fewer tokens — the agent reached the solution more directly.
That's not supposed to happen. The projection agent has less context available, not more. Why would it be more efficient?
My current theory: the baseline agent carries its own prior reasoning in context. It said "I'll fix the grouping logic next" four turns ago and it can still see that. So it does. Then it says something else and it can see that too. The growing history creates a kind of inertia — the agent follows the thread of its own narration even when a more direct path exists. The projection agent doesn't have that thread. It looks at the current state of the file, the test output from the last run, and decides what to do next. Sometimes that produces a more direct route.
I'm not confident in this explanation. It's a hypothesis that needs more data. But the pattern is consistent enough across the T4 tasks to be worth taking seriously.
t4_pratt_parser — baseline transcript (30 tool calls, 37,150 tokens)
Turn 1 [context: 386 chars, 1 message]
→ run_command: find . -name "*.py" -type f
Turn 2 [context: 905 chars, 3 messages]
→ read_file: ./test_parser.py
Turn 3 [context: 3,937 chars, 5 messages]
→ read_file: ./parser.py
Turn 4 [context: 5,044 chars, 7 messages]
→ write_file: parser.py [initial implementation]
Turn 5 [context: 17,421 chars, 9 messages]
→ run_command: python3 -m pytest test_parser.py -v
Turn 6 [context: 19,842 chars, 11 messages]
→ write_file: parser.py [fix precedence]
Turn 7 [context: 33,156 chars, 13 messages]
→ run_command: python3 -m pytest test_parser.py -v
Turn 8 [context: 35,212 chars, 15 messages]
→ write_file: parser.py [fix right-associativity]
Turn 9 [context: 49,019 chars, 17 messages]
→ run_command: python3 -m pytest test_parser.py -v
...14 more turns of iteration, context growing to 109,341 chars...
Turn 30 [context: 109,341 chars, 59 messages]
→ write_file: VERIFICATION.txt
Result: PASS ✅ 30/30 tests 37,150 input tokens
t4_pratt_parser — projection transcript (9 tool calls, 35,595 tokens)
Turn 1 [context: 1 message — goal only]
→ run_command: find . -name "*.py"
Turn 2 [context: 5 messages — goal + cached find output]
⟨hypothesis: test_parser.py contains requirements | action: read_file(test_parser.py)⟩
→ read_file: test_parser.py
Turn 3 [context: 7 messages — goal + find + test file]
⟨hypothesis: parser.py has a stub to fill | action: read_file(parser.py)⟩
→ read_file: parser.py
Turn 4 [context: 9 messages — goal + find + test + parser stub]
⟨hypothesis: I have enough to write the full implementation | action: write_file⟩
→ write_file: parser.py [complete Pratt parser, 380 lines]
Turn 5 [context: 9 messages — goal + find + test + NEW parser.py]
→ run_command: python3 -m pytest test_parser.py -v
Turn 6 [context: 9 messages — goal + NEW parser.py + pytest output]
⟨hypothesis: right-associativity for ** needs fixing | action: write_file⟩
→ write_file: parser.py [fix exponentiation]
Turn 7 [context: 9 messages — goal + UPDATED parser.py + last pytest]
→ run_command: python3 -m pytest test_parser.py -v
Turn 8 [context: 9 messages — goal + UPDATED parser.py + new pytest]
⟨hypothesis: built-in functions need arity handling | action: write_file⟩
→ write_file: parser.py [fix builtins]
Turn 9 [context: 9 messages — goal + FINAL parser.py + last pytest]
→ run_command: python3 -m pytest test_parser.py -v
Result: PASS ✅ 30/30 tests 35,595 input tokens
The baseline ran 21 more tool calls for nearly the same token cost. The projection agent's context stayed at 9 messages throughout — goal, current file state, last action. Each turn it had everything it needed and nothing it didn't.
t4_regex_engine — baseline transcript (22 tool calls, 120,718 tokens)
Turn 1 [context: 301 chars, 1 message]
→ run_command: ls -la
Turn 2 [context: 1,021 chars, 3 messages]
→ read_file: regex.py
→ read_file: test_regex.py
Turn 3 [context: 7,656 chars, 5 messages]
→ write_file: regex.py [initial NFA implementation, 11,898 bytes]
Turn 4 [context: 20,598 chars, 7 messages]
→ run_command: python -m pytest test_regex.py -v
[FAIL: python not found]
Turn 5 [context: 21,070 chars, 9 messages]
→ run_command: which python3
...context grows with each iteration: 34k, 47k, 63k, 79k, 93k, 108k chars...
Turn 22 [context: ~119,000 chars, 43 messages]
→ run_command: python3 -m pytest test_regex.py -v
Result: PASS ✅ 120,718 input tokens
t4_regex_engine — projection transcript (14 tool calls, 56,545 tokens)
Turn 1 [context: 1 message — goal only]
→ run_command: find . -type f -name "*.py"
Turn 2 [context: 5 messages — goal + find cached]
⟨hypothesis: test_regex.py has requirements | action: read_file(test_regex.py)⟩
→ read_file: test_regex.py
Turn 3 [context: 7 messages — goal + find + test file]
→ read_file: regex.py
Turn 4 [context: 9 messages — goal + test + regex stub]
⟨hypothesis: I have enough context to write full NFA implementation⟩
→ write_file: regex.py [full implementation]
Turn 5 [context: 9 messages — goal + NEW regex.py + last action]
→ run_command: python3 -m pytest test_regex.py -v
...8 more turns of targeted iteration, context stays constant at ~9 messages...
Turn 14 [context: 9 messages — goal + FINAL regex.py + last pytest]
→ run_command: python3 -m pytest test_regex.py -v
Result: PASS ✅ 56,545 input tokens (0.47x baseline)
t4_btree — baseline vs projection summary
BASELINE
30 tool calls
73,551 input tokens
Context at final turn: ~85,000 chars, 59 messages
Result: PASS ✅
PROJECTION
16 tool calls
39,051 input tokens (0.53x)
Context at every turn: ~9 messages (constant)
Result: PASS ✅
The breakeven line is becoming clearer
More data makes the pattern more precise. Tasks with fewer than around 10 baseline tool calls cost more under projection. Tasks above that line start going the other way.
The T4 wins aren't random. They're the tasks where the baseline accumulates the most historical content: repeated reads of large files, multiple rounds of test output, early exploration commands that stay in context forever. Projection doesn't carry any of that. It serves the current state of each file and nothing else.
The t4_mini_interpreter result (2.19x, 11 vs 6 tool calls) is the outlier worth understanding. Projection used nearly double the tool calls. That's the world cache filling up and the agent having to re-read files it had already read. The token budget is too small for that task's file sizes. The eviction policy is kicking out things that should stay. That's a tuning problem, not a fundamental one — but it's a real problem that needs solving before projection can be deployed confidently on tasks with large codebases.
What I'm building toward
These experiments are feeding directly into how I'm thinking about the three tools:
synaxi-predict picks the model before the session starts. The turn count predictions it makes are now also a signal for which context strategy to use. A task predicted at 20+ turns is a candidate for projection. A task predicted at 5 is probably better served by baseline with compression.
The projection engine (what these experiments are testing) replaces the accumulating history with constant context reconstructed from current reality. The data says this works at T4 complexity. The open questions are about the world cache: how big should it be, what should the eviction policy be, and how does it handle large codebases with many files.
Synaxi compresses the wire format of whatever context gets sent. Under projection, it's compressing a constant-size context rather than a growing one. The gains per request are smaller, but they compound across a session that doesn't grow.
The three layers address the same problem at different levels. Right model, right context structure, minimal wire overhead. The experiments here are filling in the middle layer.
More to come.
Synaxi is the Mac app that reduces token usage on your existing Claude Code sessions, no configuration required.