Failure modes · Propagation · Eval rigor

AI-agent reliability, measured beyond the eval.

Independent research on how agentic systems fail: how errors cascade across agents, how far they propagate, and whether the evaluations meant to certify them hold up.

Three ways in

Research, Everyday AI, and AI for Builders

One property, three doorways onto the same question: whether agentic systems are reliable enough to trust, and how you would know.

Featured research

Start with a flagship pillar

Long-form, cited analyses. One entry point per reliability theme. Or read the full library and the plain-language glossary behind it.

Eval rigor

AI agent evaluation that follows the whole trajectory

AI agent evaluation breaks when it scores the final answer and skips the path. Evaluate the trajectory, catch early-step corruption, and report pass rates with intervals.

Propagation & containment

Error propagation and cascade containment in agent systems

OWASP's ASI08 files cascading failures under security. Reframed as error-propagation engineering, one fault becomes two measurable quantities, blast radius and containment rate, mapped to topologies.

Reliability testing

How to test agent reliability beyond a single eval run

Single-run eval samples agent reliability once. Rigorous testing measures it across many runs with confidence intervals, statistical power, pass^k, and fault injection for cascade propagation.

Pattern reliability

When RAG retrieves wrong chunks: failure modes and containment

RAG pipeline failure modes are containment failures at the retrieval-to-generation boundary: five modes mapped to how each propagates, its detection signal, and the gate that holds it.

Multi-agent failures

Why multi-agent LLM systems fail, and how to contain it

Why multi-agent LLM systems fail, grounded in the MAST failure taxonomy and mapped to how each failure propagates across agent topologies and the containment levers that bound the blast radius.

Browse all research → Read the glossary →

Start here

Essential reading

A short path across the property, from the research spine to the two reader lanes.

  1. Reliability testing How to test agent reliability beyond a single eval run

    Single-run eval samples agent reliability once. Rigorous testing measures it across many runs with confidence intervals, statistical power, pass^k, and fault injection for cascade propagation.

  2. Eval rigor LLM-as-a-judge bias, and the tests that catch it

    LLM-as-a-judge bias is systematic, measurable distortion in an evaluator. A per-bias map pairs each bias with a detection test and a correction, so you can tell when a judge's ranking would flip.

  3. Everyday AI You can't hear the difference. Your AI sounds just as sure when it's wrong.

    A language model's confidence reads like clean handwriting: the page stays just as neat whether the claim underneath is solid or hollow. Why right and wrong arrive in the same voice.

  4. Everyday AI Your AI reads your calendar. A stranger's invite can give it orders.

    The path that let a calendar invite hijack Perplexity's Comet assistant is closed. The class of attack behind it is still open. What to change on any assistant that reads your inbox.

  5. AI for Builders Your dashboard stays green while your users still get wrong answers.

    LangChain's State of Agent Engineering survey of 1,340 practitioners: 89% run observability, only 52.4% run offline evals. An offline eval is what scores whether the answer was right.

Independent research, built and maintained by LatentEval. We show our methods and cite our sources. See our methodology and disclaimer.