The promise of agent swarms is real. Spin up a fleet of coding agents, point them at a backlog, and let them work. Tasks get done in parallel. Velocity goes up. You stop being the bottleneck.

The problem is that agents drift. Not in the way a junior engineer drifts, through boredom or distraction, but because language models are generative. Given the same task twice, they make different choices. Given a vague task, they make choices you didn't anticipate. Given a task with no exit criteria, they decide for themselves when they're done.

A single agent, human-supervised, is manageable. You catch the drift on review. A swarm running unattended is a different problem. The errors compound before anyone sees them.

The gap between capability and consistency

There's a version of this that works fine in demos. You give an agent a well-scoped task, it executes cleanly, you show the result. Everything looks great. The demo succeeds because the task was chosen to succeed.

Production workflows have messier inputs. Tasks come from a backlog, written by different people, at different levels of specificity. Some have clear acceptance criteria. Many don't. An agent left to interpret ambiguous requirements will produce confident, plausible, wrong output. There's no equivalent of a compiler error to catch it.

The gap between "capable of doing the task" and "consistently does the task correctly" is almost entirely a question of structure. Agents are capable. Structure is what's missing.

The instinct when an agent fails is to reach for a smarter model. Upgrade to Opus, tighten the prompt, try again. Sometimes that helps. But most agent failures aren't failures of intelligence; they're failures of structure. A more capable model given an ambiguous brief will produce a more elaborate wrong answer. The problem isn't the model. It's that nothing is holding the model accountable to a checkable output.

What deterministic progress gates actually are

The first instinct is to think of this as unit tests. Run the test suite, check the exit code, done. That's a fine gate for a workflow that's about writing code. But most real workflows aren't about writing code. They're about generating reports, migrating records, classifying documents, drafting contracts, enriching customer data, or producing any other output where the measure of success is specific to the business problem, not the test framework.

What actually works is scripts. Not tests in the software development sense, but purpose-built scripts that the agent calls, which evaluate the agent's output against a specific expectation, the same way every time.

The linter is the familiar example: the agent runs the linter, the linter reports violations, the agent can't decide the output "looks fine" and move on. The linter's judgment is not negotiable.

A more instructive example is a dataset diff. Suppose an agent's job is to recreate a dataset currently in a legacy database into your cloud lakehouse. The data should be the same, but the dialect, types, and formats it starts with are all different. You write a script that takes two datasets as inputs (the legacy source and the output the agent just produced) and performs a cell-level comparison. The agent calls this script. The script reports exactly which cells differ, how many, and whether that number is zero. The agent doesn't get to narrate its way to a passing grade. Either the migration is cell-for-cell identical, or it isn't.

This is the principle: break your problem down into gates of expectation. "The migrated data matches the source data at the cell level" is a gate. "The report contains exactly the fields defined in the schema" is a gate. "The API response for every test case matches the reference output" is a gate. Each one is a thing the agent must achieve before moving to the next step, verified by a script it calls rather than a judgment it makes.

This matters because agents are very good at convincing themselves they've finished. A confident final message from an agent that failed is indistinguishable from one that succeeded. The script is what tells the difference.

Orchestration via task boards

The other half of the problem is coordination. A swarm of agents working independently will duplicate effort, create conflicts, and produce inconsistent outputs unless something is managing the work allocation.

Task boards, the kind any engineering team already uses, turn out to be a natural fit for this. They have a few properties that make them useful as agent orchestration infrastructure:

Tasks are discrete units of work. A card is a task. An agent picks it up, works it, closes it. The board is the source of truth for what's done and what isn't.

Status columns are progress gates. Moving a card from "In progress" to "Review" to "Done" can be made conditional on the agent actually passing each gate. The board enforces the workflow rather than relying on the agent's self-assessment.

They're already where your team tracks work. The agents are doing the work your team would otherwise do. Putting that work in the same place, tracked with the same tools, means the output of the swarm is legible to the humans who own the outcome.

The pattern that works well: agents are assigned cards from a backlog column. When they complete a task, they move the card to a review column. A gate runs (automated tests, a second agent reviewing the diff, a schema check) and only on a clean pass does the card move to done. Failures move the card back, with a comment from the agent explaining what it tried and where it stopped.

This gives you a workflow that's auditable at every step. You can see what each agent did, what gate it hit, and what the outcome was. The board is the log.

Defining the agent correctly

The structure only works if the agent definition is tight. Three things matter:

Clear entry criteria. The agent should know exactly what "ready to start" looks like. If a task requires a design doc before implementation, that doc should exist before the card enters the agent's column. An agent that starts from an incomplete brief will make up the missing parts.

Explicit exit criteria. The agent's definition should specify what done looks like, in terms of checkable outputs. Not "implement the feature" but "implement the feature such that these tests pass and the type checker exits clean." The agent has a target it either hits or doesn't.

A bounded scope. Agents that can reach anywhere tend to wander. Restricting tool access (read-only on some directories, no network on others, specific MCP servers scoped to the task domain) keeps the agent in the lane the task actually requires.

Together these three things turn an agent from a general-purpose assistant into a specialist with a defined job. Specialists produce consistent output. Generalists produce variable output.

A worked example

Here's what this looks like in practice. The scenario: a company is migrating customer records from a legacy CRM to a new platform. The schemas don't match. Field names differ, some fields have been split or merged, some values use different conventions. A human used to do this table by table, writing and testing transformation logic for each one. Now an agent does it.

The work is genuinely agentic. The agent has to read the source schema, understand what each field represents, find the corresponding field in the target schema, handle the cases where there's no clean mapping, write the transformation, and deal with the edge cases it discovers along the way. That's judgment work. You can't script it without essentially solving the problem yourself first.

But the output is verifiable. Once the agent believes it has migrated a table, you can check whether it actually did so correctly by running a cell-level diff between the transformed data and what the target schema requires.

---
name: crm-migrator
description: Migrates customer records from the legacy CRM schema to the new platform schema, table by table.
tools: Bash, Read, Write
---

You are a data migration agent. For each ticket you will be given a source
table name, the source schema, and the target schema.

Work through these steps in order. Do not proceed past a gate until it passes.

1. Move the ticket to "In Progress".
   tool: ticket_move(ticket_id, "In Progress")

2. Analyse both schemas. Identify how each source field maps to the target.
   Some mappings will be direct renames. Some will require transformation logic
   (unit conversions, date format changes, splitting a full_name field into
   first_name and last_name). Some source fields may have no target equivalent
   and should be dropped. Document your mapping decisions in mapping.json before
   writing any transformation code.

   Gate: call validate_mapping(ticket_id, "mapping.json", source_schema, target_schema)
   This checks that every target field is accounted for and that your mapping
   logic is internally consistent. If it fails, review the gaps and resubmit.

3. Write and run the transformation against the full source table.
   Produce the output as output.csv.

4. Validate that the transformation preserved all records correctly.
   Gate: call diff_datasets("source.csv", "output.csv", "mapping.json")
   This runs a cell-level diff using your mapping as the translation key. It
   checks that every source value appears in the expected target field with the
   expected transformation applied, and that no records were dropped or
   duplicated. If the diff reports mismatches, read the diff output, identify
   what your transformation got wrong, fix it, re-run, and call the gate again.
   Do not proceed until the diff reports zero mismatches.

5. Move the ticket to "Ready for Review". Add a comment summarising: number of
   records migrated, fields dropped, any mapping decisions that required
   judgment and why you made them.
   tool: ticket_move(ticket_id, "Ready for Review")
   tool: ticket_comment(ticket_id, summary)

The agent is doing real work: reading two schemas it has never seen, reasoning about how they relate, handling ambiguous cases, writing transformation code. None of that is predetermined. But the gate at step 4 doesn't ask the agent whether it thinks the transformation was correct. It runs diff_datasets, which produces a count of mismatching cells. That number is either zero or it isn't.

When the diff reports that 34 records have a mismatched created_at value, the agent reads the diff output, realises it converted timestamps to the wrong timezone, fixes the transformation, and runs the gate again. The ticket stays in "In Progress" until the diff is clean. The board reflects this honestly.

Nothing in this required a smarter model. It required knowing what "correctly migrated" means precisely enough to write a script that checks for it.

Consistency at scale

The value of this pattern compounds with scale. A single agent running an ad-hoc task is fine with loose structure. Ten agents running a sprint's worth of tasks in parallel need gates, because the failure modes of ten agents are ten independent failure modes, and you can't review all of them before they've moved on.

Deterministic gates short-circuit that problem. They don't require human review of every agent output to catch failures. They catch failures automatically, at the point where they occur, before the work is marked done. The humans review exceptions, not everything.

This is the operational model that makes agent swarms viable in production rather than just in demos. The agents do the work. The gates keep them honest. The task board makes the whole thing visible.

Where cost fits in

Running a swarm unattended means every model selection decision gets made without you. That's fine if you've thought about it in advance. It becomes a problem if each agent defaults to the most capable model available, because the aggregate cost of a sprint's worth of tasks at premium model rates is not a small number.

This is where synaxi-predict fits into a swarm setup. It predicts cost, turn count, and pass rate for each task before it runs, and recommends a model based on that prediction. In an unattended workflow, that recommendation needs to be acted on automatically. You can't have the swarm pausing to ask which model to use.

The auto mode we released today does exactly that. Set SYNAXI_PREDICT_AUTO=true in your project's .claude/settings.json, and each agent task gets the predicted best-value model without any prompt. The swarm keeps moving. The model selection isn't a guess; it's a prediction trained on 53,000 real agent runs.

Deterministic gates handle the quality side. Cost prediction handles the efficiency side. Between them, they close most of the gap between a promising agent demo and a workflow you'd actually run in production.


synaxi-predict is open source under MIT. Install it into any Claude Code session with /plugin marketplace add BeadW/synaxi-predict followed by /plugin install synaxi-predict. Auto mode is available from v0.3.3.

Synaxi reduces token costs on the request side, stripping schema duplication, stale history, and structural overhead from every outgoing Claude request. synaxi-predict reduces costs on the selection side. Together they cover both ends of the problem, whether you're running one agent or a hundred.