Multi-agent failures

Multi-agent orchestration patterns and the failures they amplify

The five multi-agent orchestration patterns, supervisor, sequential-pipeline, swarm, debate, and blackboard, mapped to how errors cascade in each and the failure modes each one amplifies.

By LatentEval Published 2026-04-01 Updated 2026-05-25

You are choosing how your agents will coordinate, or you inherited that choice from whichever framework you started with, and the decision is quietly a reliability decision. A supervisor routing workers, a chain passing outputs down a line, a swarm of peers handing control around, a debate that votes, a blackboard everyone reads and writes: each pattern buys you a coordination capability and hands you a specific way errors travel. Most teams pick for capability and meet the failure behavior later, in production. This page keys the failure modes to the orchestration pattern that amplifies them, so you can read the reliability tax before you commit.

Each orchestration pattern amplifies a different family of failures

The pattern decides where control lives, and in an agent system control flow is also failure flow: an output moves along the same edge a fault does. The failure-mode names in the last-but-one column below are drawn from MAST, the first empirically grounded taxonomy of multi-agent LLM failures. The mapping from pattern to mode is our synthesis, grounded in that taxonomy and in how faults propagate across each shape. Treat it as an analytic framework whose per-pattern rates are not yet measured.

Orchestration pattern	Where control lives	How an error cascades	MAST failure modes it amplifies	Reliability judgment
Supervisor	A central agent plans, delegates to workers, and merges their returns	The merge is the amplifier: the supervisor integrates a worker return into its plan without re-checking it, and every subtask it then derives stands on that unverified claim	Disobey task specification, ignored input, no or incomplete verification	Blast radius concentrates at the hub: cheap to instrument at one choke point, but that choke point is the largest single failure surface. Verify each return before the merge folds it in
Sequential pipeline	Control moves stage to stage; each agent owns one step and hands to the next	Every handoff is one-directional, so a stage that drifts or drops a detail hands the next stage a corrupted premise it cannot question, and the deviation accumulates through the stages that remain	Task derailment, information withholding, reasoning-action mismatch	Lowest breadth, deepest reach; end-to-end success decays with length. The easiest pattern to gate, one acceptance check per stage
Swarm	Peers hand control to one another at runtime; no central authority fixes the path	The fault follows whichever handoff chain the agents pick at runtime, so the reached set shifts from run to run and resists reproduction	Disobey role specification, conversation reset, unaware of termination conditions	Adaptivity bought with attribution: you cannot easily name the hop that introduced the fault. Safe only if every handoff is typed and logged
Debate / voting	Several agents argue or vote; an aggregation step selects the answer	Agents seeded from the same context share a blind spot, so the aggregation step reads near-duplicate answers as independent agreement and stamps the shared error as the winner, outvoting any dissent	Incorrect verification, ignored input, information withholding	The aggregation meant to raise reliability can manufacture false agreement; its reach scales with how correlated the voters are. Needs independent seeding and an out-of-panel check
Blackboard	Agents coordinate implicitly by reading and writing a shared state store	One write lands in a store the whole system shares, so a poisoned entry becomes common ground for the next round and every agent that acts on it carries the fault forward	Step repetition, loss of conversation history, no or incomplete verification	Widest reach and highest fault correlation of any pattern; a single bad write is topology-wide. The efficiency of shared state is exactly its exposure

Read the last column as a shopping list of liabilities, one per pattern. You are picking a failure geometry, not only a coordination strategy. A supervisor gives you one place to watch and one place to lose everything. A chain gives you cheap gates and a success rate that erodes with every stage you add. A swarm gives you runtime flexibility and takes your ability to say what went wrong.

Two patterns invert the usual intuition. Debate is sold as an error-correcting mechanism, yet it correlates failures whenever the panel shares a prompt, so the vote can ratify a shared mistake as consensus. Blackboards are sold as efficient shared memory, yet that shared surface is precisely what lets a single bad write land across the whole topology in one hop.

What the pattern actually selects: where control lives

Multi-agent orchestration is the layer that decides which agent runs when, what each one is handed, and who resolves a disagreement. Those decisions are what the five patterns name, and each answer sets a different exposure. A supervisor centralizes the “who decides” question into one agent; a swarm distributes it across peers; a blackboard hides it inside a shared store. The reliability consequences follow from that placement before any single agent’s quality enters the picture.

The failure vocabulary here is not speculative. MAST derived 14 recurring failure modes in 3 categories from 150 execution traces read by expert human annotators, validated by high inter-annotator agreement (kappa 0.88), then scaled annotation to a 1,600+ trace dataset across 7 frameworks with an LLM-as-judge pipeline (Cemri et al., Why Do Multi-Agent LLM Systems Fail?, arXiv:2503.13657, preprint, as of 2026-07). The rigorous agreement statistic belongs to the 150 human-annotated traces; the 1,600+ dataset was model-annotated at scale, so the two carry different evidential weight.

This page does not re-derive the mechanics of propagation, because two companion pages already own them. For why a fluent, well-formed, wrong output slips past every structural gate and becomes a trusted premise, read the reframe of ASI08 as error propagation across agent systems; for the full failure-class view, the multi-agent failure-mode taxonomy maps each mode to how it travels across a wiring. What is new here is the axis: the same error-propagation lens keyed on the orchestration decision itself, so the choice of pattern carries its failure profile with it.

Multi-agent workflows are the same map with the routing pinned down

A workflow is an orchestration pattern with its control flow committed ahead of time. Anthropic draws the line cleanly: “Workflows are systems where LLMs and tools are orchestrated through predefined code paths,” whereas “Agents, on the other hand, are systems where LLMs dynamically direct their own processes and tool usage, maintaining control over how they accomplish tasks” (Anthropic, Building effective agents, as of 2026-07). The reliability reading of that distinction is direct: a workflow trades adaptivity for a smaller, more testable failure set, because a path fixed in code cannot wander the way a self-directing agent can.

Multi-agent workflows are therefore those same patterns with the runtime freedom dialed down. The common workflow shapes each inherit the cascade geometry of the pattern they resemble.

Workflow shape	Control flow	Cascade geometry it inherits
Prompt chaining	A fixed linear sequence, decided in code	The sequential-pipeline geometry: depth compounding, so gate every step and keep the chain short
Routing	A classifier sends input down one of several predetermined branches	A shallow supervisor geometry: a misroute commits the whole branch, so the routing decision is the point to verify
Parallelization	Independent subtasks run concurrently, then merged (sectioning) or voted (voting)	Sectioning keeps faults isolated with little cross-talk; voting inherits the debate correlation risk when branches share context
Orchestrator-workers	A central LLM delegates dynamically and synthesizes the results	The supervisor geometry, with blast radius concentrated at the synthesis step
Evaluator-optimizer	One agent generates, another critiques, in a loop	A pipeline with a verification hop: a bounded loop helps, an unbounded one becomes its own cascade

In that same guidance, the orchestrator-workers shape is defined as a workflow in which “a central LLM dynamically breaks down tasks, delegates them to worker LLMs, and synthesizes their results,” which is the supervisor pattern with its control flow scripted. The practical lesson is that pinning the routing down is itself a containment move: it shrinks the set of paths a fault can take, which is why a reliability-conscious team often earns its way up the ladder, starting at the fixed-workflow end and adding dynamic delegation only where the extra capability pays for the wider failure surface.

Reading the reliability tax before you commit

Each pattern’s tax is a different metric to watch, and naming it tells you where to spend. The reads below reference the lane’s measurements by the angle each pattern needs; the canonical definition of each lives in the glossary.

A supervisor concentrates exposure at the hub, so the number that governs it is how far a single trusted merge reaches. That is blast radius read at one node, and it is the whole story of orchestrator-worker reliability: one verified return upstream of the merge caps the fan-out.

A sequential pipeline pays in depth. Because every stage builds on the output before it, the failure to watch is compounding error propagation down the line, and a single passing run does not establish reliability across repeated runs of the whole chain.

A swarm pays in attribution. When the handoff path changes every run, pinpointing which hop introduced a fault becomes the expensive problem, so failure attribution is the capability you must build in before the pattern is defensible.

A debate or voting panel pays in correlation. The failure the pattern hides is a shared blind spot dressed as agreement, the mechanism behind multi-agent debate failure, and the reason consensus and voting reliability turns on voter independence.

A blackboard pays globally. One bad write is topology-wide, so the architecture-level number that matters is the store’s cascade resistance, its tendency to damp a poisoned entry against its tendency to broadcast it.

None of these patterns is reliability-optimal on its own, and real systems are hybrids: an orchestrator whose workers are short chains, all writing to a shared board the orchestrator also reads. Decompose the hybrid into its primitives, find the tax each primitive levies, and let the matrix set the priorities.

Measure the pattern before you refactor it

Which pattern is more reliable for your workload is an empirical question you settle by injecting a fault and watching where it goes. Introduce a controlled fault at a known point in a candidate pattern, trace the set of agents it reaches, and repeat until the containment rate carries a confidence interval instead of a lone number. To compare two candidate patterns, put them on the same cascade-resistance axis, which tracks production reliability more faithfully than a single-run task score. The method itself is covered in agent reliability testing, and turning the traces into numbers you can defend is covered in how to measure agent reliability.

The reliability profiler this site points toward is a pre-launch instrument designed to place a fault inside a chosen orchestration pattern, follow its reach across the agents, and report a containment rate with a bootstrap interval; no per-pattern containment number is measured yet, so this page offers no measured rate. Until it ships, the matrix above is the working map and the failure-mode taxonomy is the mechanism under its columns; the research program will carry the measured per-pattern rates, each with its interval, as the profiler produces them.