TL;DR;
Part 1 gave me shapes. Part 2 gave me plumbing.
In Part 1, I was mostly drawing boxes and arrows. That felt like I was progressing, which seemed strange as I am pretty new to AI-agent workflow orchestration. Boxes are pretty verbose to explain stuff. Real systems, unfortunately, has timeouts, missing context, missing memory slots, tools errors, wrong or non-specific handoffs, and at least one agent over-confidently solving the problem in ways that is not expected.
My current suspicion: multi-agent systems are not hard because agents are hard. They are hard because the coordination is hard.
Part 1 ended with a warning: “A bad multi-agent system just spreads confusion across multiple prompts”. Part 2 now asks: “where exactly does that confusion come from?”
The question behind this part:
Once we split work between agents, what must exist around them so the system does not become a polite distributed hallucination machine?
Agents vs. Workflows: Not to Confuse the Actor with the Process
Before going deeper, a distinction I keep confusing myself on.
An agent is a role - a set of instructions, tools, and permissions aimed at a specific job. A workflow is the process - that decides which agent acts, when, with what input, and what happens with the output.
This matters because most failure modes I have been reading about are not agent failures. They are - basically - workflow failures. The planner was fine. The researcher was fine. Nobody told the researcher what the planner actually decided, so the researcher went on a confident solo run.
Swapping a smarter “model” into a broken workflow does not fix the workflow. It just produces more confident wrong answers.
The takeaway I am learning: “design the workflow first. Assign agents to it second”.
The Coordination Tax
In Part 1, I treated adding agents like adding workers. More hands, more throughput. That was obviously naive!
Every agent I add is NOT just another worker.
It is another mouth to feed with context, another diary to read-through, another permission to assign, another small committee meeting disguised as software architecture.
The coordination overhead scales faster than the headcount (practically):
- 2 agents → 1 possible interaction
- 4 agents → 6 possible interactions
- 10 agents → 45 possible interactions
Each interaction is a place where context can leak, meaning can drift, and cost can spike. This is the coordination tax: the price you pay not for the agents themselves, but for the fact that they need to talk to each other.
A multi-agent system becomes useful only when the separation of roles creates more clarity than the coordination between those roles destroys.
That is the budget/optimization equation I keep coming back to. And it rarely gets calculated before someone adds a nth agent "just in case".
Handoffs: Where Meaning Leaks
Of everything I have read so far, “Handoff"s are the single biggest source of silent failure.
A handoff is the moment one agent finishes its work and passes the result to the next. It sounds simple. But looks like it is not. Because what gets passed is not just data. It is meaning. And meaning compresses badly.
Let’s imagine a planner agent decides: “We should query the last 90 days of revenue, broken down by region, excluding trial accounts”. The planner passes a summary to the data agent. The summary says: “Get revenue data by region”. The data agent fetches all-time revenue for all account types. Technically responsive. Practically useless.
This is the plumbing issue.
Modern agent frameworks increasingly treat handoffs as a first-class concept. OpenAI’s Agents SDK describes handoffs as a way for one agent to delegate to another specialist. LangChain’s architecture docs describe implementations through dynamic agent configuration or distinct multi-agent subgraphs. Both are trying to formalize what otherwise becomes a game of telephone between language models.
Reading from all these - my understanding of a good handoff needs:
- Structured output from the sender: not a free-text summary, but a defined shape (task description, constraints, expected output format)
- Explicit acceptance by the receiver: the next agent should be able to say “I do not have enough to start”
- A trace of what was passed: if something goes wrong downstream, you need to see what crossed the boundary
Without these, handoffs are just optimistic copy-paste between context windows.
Stay in the loop
If this writeup was useful and you want future updates, subscribe to the newsletter.
No spam. Unsubscribe anytime.
Shared State and Memory: Not Every Agent Needs the Whole Attic
When multiple agents work on the same task, they need some shared context. But “shared context” does not mean “dump everything into one giant memory pool”.
Memory is not one bucket. Some things are scratchpad - temporary working notes an agent uses mid-task. Some things are audit log - immutable records of what happened and why. Some things are project knowledge - facts that persist across the entire workflow. And some things should be forgotten before they become archaeological garbage.
The failure mode I keep seeing people complaining about in socials: one agent hallucinates a detail. It writes that detail into shared memory. The next agent reads it, treats it as fact, and builds on it. By the time a human sees the final output, the hallucination has been laundered through three agents and looks perfectly plausible.
This is shared memory pollution, and it is worse than a single-agent hallucination because the error gets reinforced at every hop.
So my understanding so far - about what shared state needs:
| Layer | Purpose | Lifetime | Who writes |
|---|---|---|---|
| Scratchpad | Working notes for a single agent | One task | The agent itself |
| Handoff payload | Structured input for the next agent | One transition | The sending agent |
| Workflow state | Decisions, constraints, accumulated facts | Entire workflow | Orchestrator or designated agents |
| Audit log | Immutable record of actions and outputs | Permanent | System (automatically) |
The principle: agents should read only what they need, write only what they own, and never silently overwrite something another agent depends on.
Verification: Who Will Watch the Watchers?
One of the patterns I find most interesting from Anthropic’s published workflows is the evaluator-optimizer loop: one model generates output, another evaluates and refines it against defined criteria. This is useful when the evaluation criteria are clear and iterative improvement has measurable value.
But I am slowly getting proof for my previous suspicion that “another agent reviewed it” is not automatically safety. Sometimes it is just two confident parrots wearing lab coats.
I think verification only works when:
- the verifier has different context or criteria than the generator (otherwise it is just self-confirmation with extra latency)
- the verification criteria are specific and testable (“Is the SQL syntactically valid?” beats “Does this look good?”)
- there is a defined exit condition (otherwise two agents can endlessly “improve” each other’s output, burning tokens in a polite infinite loop)
A verifier agent that just re-reads the output and says "looks good" is not verification. It is a rubber stamp with a temperature setting.
The architectural insight: verification adds value when the verifier checks something the generator could not check about itself - a different data source, a policy constraint, a structural rule.
Observability: Without Traces, the System Becomes Fog
This is the section I underestimated the most before reading production case studies.
With a single agent, debugging is linear: what went in, what came out, why does the output look wrong. With multiple agents, the failure might have originated three steps back, in an agent whose output looked perfectly fine in isolation.
OpenAI’s Agents SDK tracing docs describe capturing LLM generations, tool calls, handoffs, guardrail checks, and custom events. Microsoft’s Agent Framework overview emphasizes session-based state, filters, telemetry, and type safety. Both point in the same direction: you need traces, not just logs.
Based on what I have read so far, here is what I would consider a minimum trace event for each agent step:
| Field | Why |
|---|---|
agent_id | Which agent acted |
step_type | LLM call, tool call, handoff, verification |
input_summary | What the agent received (or a hash/reference) |
output_summary | What the agent produced |
tokens_used | Cost attribution per step |
latency_ms | Time spent, including waiting |
parent_trace_id | Link to the orchestrator or previous agent |
decision_reason | Why this agent was invoked (routing logic) |
status | Success, failure, timeout, retry |
Without this, debugging a five-agent workflow is archaeology without a dig site map. You know something is buried. You just do not know where to start digging.
The frameworks are converging here. MLflow supports auto-tracing for multi-agent systems with nested span capture. CrewAI, LangGraph, and others expose callbacks and hooks. The tooling exists. The hard part is deciding to use it before the first production incident forces you to.
Permission Boundaries: The Boring Part That Prevents Expensive Comedy
This section is short because the principle is simple and the consequences of ignoring it are not.
Tools should be assigned by role, not by convenience.
A research agent does not need write access to a database. A drafting agent does not need to call billing APIs. A verification agent should not be able to deploy code.
In a single-agent system, this is manageable because there is one set of permissions to review. In a multi-agent system, every agent is a separate attack surface with its own tools, its own context, and its own ability to misinterpret instructions creatively.
The minimum I would want:
- Per-agent tool allow-lists: each agent can only call the tools it needs
- Read vs. write separation: most agents should default to read-only
- Sensitive action gating: anything with real-world consequences (sending emails, modifying data, spending money) should require explicit confirmation - from a human or from a designated approver agent with its own audit trail
Permission boundaries are not exciting architecture. They are the seatbelt. Nobody talks about them until the crash.
Cost and Latency: Every Agent Sends an Invoice
Every agent call costs something. Hosted APIs charge per token. I had this misconception about local models after using a few from Ollama. I was feeling absolute power using Deepseek from command line.
But eventually it got to me - that even local models are not free either - they cost hardware time, memory pressure, context window limits, and throughput.
So may be, avoid saying “local is free”. Say “local shifts the cost from invoice to machine”.
The coordination tax shows up in the bill in several ways:
- Context duplication: each agent needs enough context to do its job, so the same information often gets sent multiple times
- Verification overhead: a verifier agent adds at least one extra model call per workflow
- Retry loops: if a handoff fails or a tool errors, the retry multiplies cost
- Latency stacking: in a sequential pipeline, each agent adds 100–500ms of latency; five agents can add 2+ seconds before anyone sees a result
Costs multiply faster than agent count. A task that costs $0.10 for a single agent might cost $1.50 for a four-agent system - not because four agents cost four times as much, but because coordination context, retries, and verification add exponential overhead.
So the practical question becomes: is the quality improvement from splitting this into multiple agents worth the cost and latency increase?
Sometimes yes. Often not.
Always worth asking before adding agent number five.
Putting It Together: A Reference Flow
Here is the general shape of what a coordinated multi-agent workflow looks like when these concerns are addressed. No code - just plumbing.
flowchart TD
U["👤 User Request"] --> P["📋 Planner Agent<br/><i>decomposes task, sets constraints</i>"]
P -->|structured handoff| R["🔍 Research / Context Agent<br/><i>gathers data, reads sources</i>"]
R -->|findings + evidence| W["✏️ Worker / Drafting Agent<br/><i>produces output</i>"]
W -->|draft output| V["✅ Verifier Agent<br/><i>checks against criteria</i>"]
V -->|"pass ✓"| F["📤 Final Response to User"]
V -->|"fail ✗ - with reason"| W
P -.->|logs| T["📊 Trace Log + Decision Diary"]
R -.->|logs| T
W -.->|logs| T
V -.->|logs| T
Each arrow is a handoff. Each handoff has a structured payload. Each agent logs its actions. The verifier can reject and loop back, but with a defined exit condition (not infinite retries). The trace log captures every decision for debugging and audit.
This is not the only valid shape. Part 1 covered several others. But this one captures the coordination concerns discussed above: planning, context passing, drafting, verification, and observability - all with explicit boundaries rather than hopeful assumptions.
Are you liking this so far?
For upcoming articles like this, subscribe to the newsletter and get them in your inbox.
No spam. Unsubscribe anytime.
What I learnt so-far / Ready for a project?
Multi-agent systems are not primarily an AI problem. They are a systems engineering problem. Here is what I am taking away:
- Handoffs need structure, not summaries. Free-text delegation is where meaning leaks silently.
- Memory should be layered. Scratchpad, handoff payload, workflow state, and audit log are not the same bucket.
- Verification needs criteria. A second agent re-reading the output is a rubber stamp, not a safety net.
- Traces beat logs. Without parent-child span linking, debugging a five-agent failure is guesswork.
- Permissions belong per role. Every agent is a separate surface that can misinterpret instructions creatively.
- Cost scales with coordination, not headcount. The bill comes from context duplication and retries, not from adding one more model call.
Part 1 gave me the the what of the multi-agent systems.
Part 2 showed me the plumbing - how I can orchestrate them.
In Part 3, I will try to build something real - and find out how much of this theory survives actual tool calls, actual handoffs, and actual agents.
What I read on tech last week
- LangChain’s overview of subagents, skills, handoffs, and router patterns with practical performance comparisons
- Five-layer breakdown for production-grade multi-agent systems
- Seven production failure patterns and the exponential scaling of coordination costs of multi-agent systems
- Practical tracing, metric capture, and state handoff monitoring for CrewAI workflows
Comments