Multi-agent debate failure
Multi-agent debate failure is the mode where correlated agents debating a task converge on a confident wrong answer, so the exchange ratifies a shared error and returns it as consensus. The result is hallucinated consensus: agreement a transcript cannot distinguish from a correct result.
Multi-agent debate failure is what happens when a panel of debating agents talks itself into a shared wrong answer: the rounds meant to catch the mistake instead ratify it and return it as agreement. Debate was introduced to improve factuality and reasoning by having model instances critique each other across rounds (Du et al., ICML 2024; arXiv:2305.14325). The correction only holds when the debaters actually disagree, so an objection can move the panel off a wrong answer instead of rehearsing it.
Correlation breaks the premise. Agents sharing one base model, prompt, and context window draw from a single distribution, so a confident early claim meets no genuine dissent, and every round pulls the remaining agents toward it by persuasion. What compounds across the rounds is conformity, not evidence, so the final transcript shows a unanimity the process manufactured. Call the result hallucinated consensus: agreement no reader can separate from a correct result.
What separates debate from a one-shot vote is the loop between rounds. A vote fixes each ballot before any agent sees another; debate lets agents read the emerging position and revise toward it. That read-and-revise loop is the correction debate promises, and under correlation it is also the channel that carries the early error across the panel instead of catching it.
A debate’s cascade resistance decides which way it breaks: independent seeding and an outside verifier let a lone fault be argued down, while correlation lets error propagation collapse the panel onto the shared error. The multi-agent failure-mode map treats debate as its own convergence topology, where a coupled vote can amplify one agent’s error instead of exposing it.