Failure modes · Propagation · Eval rigor
AI-agent reliability, measured beyond the eval.
Independent research on how agentic systems fail: how errors cascade across agents, how far they propagate, and whether the evaluations meant to certify them hold up.
Three ways in
Research, Everyday AI, and AI for Builders
One property, three doorways onto the same question: whether agentic systems are reliable enough to trust, and how you would know.
Research
Independent, cited analysis of how agentic systems fail: how errors cascade across agents, how far they propagate, and whether the evaluations meant to certify them hold up.
Everyday AI
For people who rely on AI day to day: what to trust, what to double-check, and where agents quietly act against your interests.
AI for Builders
For engineers shipping agents: reliability patterns, real failure modes, and the gaps that leaderboard benchmarks leave uncovered.
Featured research
Start with a flagship pillar
Long-form, cited analyses. One entry point per reliability theme. Or read the full library and the plain-language glossary behind it.
AI agent evaluation that follows the whole trajectory
AI agent evaluation breaks when it scores the final answer and skips the path. Evaluate the trajectory, catch early-step corruption, and report pass rates with intervals.
Error propagation and cascade containment in agent systems
OWASP's ASI08 files cascading failures under security. Reframed as error-propagation engineering, one fault becomes two measurable quantities, blast radius and containment rate, mapped to topologies.
How to test agent reliability beyond a single eval run
Single-run eval samples agent reliability once. Rigorous testing measures it across many runs with confidence intervals, statistical power, pass^k, and fault injection for cascade propagation.
When RAG retrieves wrong chunks: failure modes and containment
RAG pipeline failure modes are containment failures at the retrieval-to-generation boundary: five modes mapped to how each propagates, its detection signal, and the gate that holds it.
Why multi-agent LLM systems fail, and how to contain it
Why multi-agent LLM systems fail, grounded in the MAST failure taxonomy and mapped to how each failure propagates across agent topologies and the containment levers that bound the blast radius.
Start here
Essential reading
A short path across the property, from the research spine to the two reader lanes.
- Reliability testing How to test agent reliability beyond a single eval run
Single-run eval samples agent reliability once. Rigorous testing measures it across many runs with confidence intervals, statistical power, pass^k, and fault injection for cascade propagation.
- Eval rigor LLM-as-a-judge bias, and the tests that catch it
LLM-as-a-judge bias is systematic, measurable distortion in an evaluator. A per-bias map pairs each bias with a detection test and a correction, so you can tell when a judge's ranking would flip.
- Everyday AI You can't hear the difference. Your AI sounds just as sure when it's wrong.
A language model's confidence reads like clean handwriting: the page stays just as neat whether the claim underneath is solid or hollow. Why right and wrong arrive in the same voice.
- Everyday AI Your AI reads your calendar. A stranger's invite can give it orders.
The path that let a calendar invite hijack Perplexity's Comet assistant is closed. The class of attack behind it is still open. What to change on any assistant that reads your inbox.
- AI for Builders Your dashboard stays green while your users still get wrong answers.
LangChain's State of Agent Engineering survey of 1,340 practitioners: 89% run observability, only 52.4% run offline evals. An offline eval is what scores whether the answer was right.
Independent research, built and maintained by LatentEval. We show our methods and cite our sources. See our methodology and disclaimer.