Reliability testing

AI agent reliability, from consistency to containment

AI agent reliability is a discipline of five properties: consistency, robustness, predictability, safety, and error propagation, with a map of where each is measured.

By LatentEval Published 2026-04-01 Updated 2026-05-25

AI agent reliability is a small set of measurable properties, five of them: consistency across repeated runs, robustness when a dependency degrades, predictability across similar tasks, safety when the worst case fires, and how far one agent’s error propagates before something contains it. An agent can pass its eval every time someone happens to watch and still fail on any one of them: a different answer on the next identical run, a fall-over when one tool times out, a divergent path through two near-identical tasks, a reach past its permissions, or a wiring that lets one bad step quietly poison every agent downstream. Those are distinct problems with distinct instruments. This page organizes them and routes you to the asset that operationalizes each.

The most common mistake at this stage is treating reliability as a synonym for accuracy. Accuracy asks whether an answer is correct on a given run. Reliability asks whether it stays correct across runs, under degraded dependencies, and once it is composed with other agents. A model can be more accurate this quarter than last and no more reliable, because the two properties move on different curves.

The five properties “reliable” actually names

Reliability decomposes into five measurable properties. Four of them come from a growing effort to put agent reliability on a scientific footing; the fifth is the one this lane exists to add. The table is the map. For each property it states the question you are forced to answer, the failure you inherit when you leave it unmeasured, where in this toolkit it gets operationalized, and our read on how well the field measures it today.

Reliability property	Question it forces	Failure signature if unmeasured	Where the toolkit operationalizes it	Our read on the field’s practice
Consistency	Does the same input produce the same result across repeated runs?	A single green run hides a wide, unreported pass-rate distribution	Measurement of reliability@k and pass^k, reported with intervals	Usually shipped as one pass rate; the distribution is rarely shown
Robustness	Does the agent hold up when a dependency degrades, stalls, or returns stale data?	A transient tool fault surfaces as a confident wrong answer	Rigorous fault-injection testing	Exercised by happy-path evals far more than by injected faults
Predictability	Does behavior stay stable across similar tasks and over time?	Plans drift; near-identical inputs take divergent paths	Topology-aware production reliability budgeting	Watched in production more than bounded at design time
Safety	When the agent fails, how catastrophic is the worst outcome?	A rare destructive action, such as an unintended data-deleting write, is averaged in with benign errors	Named here; the propagation layer extends its severity across the topology	Severity is usually collapsed into an average error rate rather than tracked apart from frequency
Propagation & containment	When one agent fails, how far does the error travel before something stops it?	A local fault becomes a system-wide cascade	Failure-mode taxonomy, cascade containment, and the blast-radius vocabulary	The layer few combine with the other four, and where this lane concentrates

The last column records a judgment rather than a measurement. It marks where instrumentation is mature and where it is mostly aspiration, and it explains why the toolkit below leans toward the properties the field currently under-measures: consistency under repetition, robustness under injected faults, and propagation across a topology. If your team measures nothing else, those three are where a bad surprise in production is most likely hiding.

Where four of these come from, and why we add a fifth

The first four properties are not ours to claim. A study accepted at ICML 2026 sets out to make agent reliability a measured science and decomposes it into exactly these dimensions: “twelve concrete metrics that decompose agent reliability along four key dimensions: consistency, robustness, predictability, and safety” (Rabanser et al., Towards a Science of AI Agent Reliability, arXiv:2602.16666, accepted ICML 2026, as of 2026-07). Its headline result is a warning to anyone betting on the next model to fix reliability: across the models it evaluated, recent capability gains produced only small improvements in reliability. Buying a stronger model does not buy you a more reliable system.

That framework stops at the edge of a single agent. It measures whether one agent is consistent, robust, predictable, and in-bounds. It says nothing about what happens when that agent is one node in a system and its failure travels. That is the fifth property, and it is where this lane concentrates: propagation and containment.

The fifth property is load-bearing because single-agent reliability does not compose. MAST, the first multi-agent-system failure taxonomy, was built from 150 execution traces annotated by human experts (kappa = 0.88), then applied at scale to a dataset of more than 1,600 traces across seven frameworks annotated by an LLM-as-judge pipeline. The failures fall into “14 unique modes, clustered into 3 categories: (i) system design issues, (ii) inter-agent misalignment, and (iii) task verification” (Cemri et al., Why Do Multi-Agent LLM Systems Fail?, arXiv:2503.13657, preprint, as of 2026-07). Two of those three categories describe how agents mislead each other and fail to catch it. Neither is a single-agent capability problem. A system built from four consistent, robust agents can still fail when a fault in one propagates unchecked into the other three.

Measuring that propagation is a separate discipline from scoring one agent. It means tracing how one agent’s mistake becomes another’s trusted input and pinning a system failure back to the step that actually caused it, then bounding how far the next fault gets. The arithmetic behind why a chain of individually strong agents ends up an unreliable system is worked through in the cascade-containment pillar; this hub points at it rather than repeating the math.

The differentiation is the combination. Anyone can score a single agent. The work is measuring the four single-agent dimensions with confidence intervals and run-to-run variance rather than a lone pass rate, and adding the propagation layer on top. Per-agent rigor and the propagation layer are each shallow on their own; together they are the map.

The toolkit, and why each asset belongs

The assets below are the operational half of that map. Each answers one of the five properties, and each earns its place by carrying something a keyword and the public web cannot regenerate: a cited taxonomy, a worked measurement method, a budgeting model. Start with the group that matches the property you are weakest on.

The failure modes and how errors spread. The multi-agent failure-mode taxonomy grounds the propagation property. It maps MAST’s failure modes to how each one travels across a topology and which lever contains it. The cascade-containment pillar goes deeper on error propagation itself: how a single fault fans out, and how far you let it reach before a gate stops it. Because almost none of this is new to systems engineering, the microservices-to-agents translation checks which proven resilience patterns port cleanly to agents and which quietly break when the components are language models.

Testing and measurement. The reliability-testing pillar is the entry point for robustness and consistency. It covers fault injection and the statistical machinery that turns a single run into a measurement you can defend. The measurement explainer is where consistency becomes concrete: reliability@k and pass^k reported with intervals, so a pass rate arrives as a distribution instead of a coin flip.

Production. The production playbook is the predictability asset. It lays out a topology-aware way to spend a fixed reliability budget where it moves the end-to-end number most, rather than hardening whichever agent is easiest to reach.

The vocabulary. The propagation property needs precise terms, and each has a canonical entry in the reliability glossary: how far a fault reaches before something stops it, the fraction of faults held to a single hop, and whether a topology damps a fault or amplifies it. Link to these when you need them rather than redefining them in place; the definition lives once, in the glossary, so the numbers stay in one home.

Safety is the one property in the map without a dedicated asset here. The source defines it as the severity of the worst outcome when an agent fails, split into a compliance component for staying within its constraints and a harm component that separates how bad a violation is from how often it happens. That severity is a single-agent measure. What the propagation layer adds is reach: how far that worst outcome travels once the agent is one node in a topology, which is what blast radius quantifies. We name safety to keep the map complete and carry its severity into the propagation work rather than restating a single-agent metric the source already defines.

Where to start

Pick by symptom. If your eval passes but you cannot say how often it passes, the gap is consistency, and the measurement work is the first stop. If it passes until a dependency hiccups, the gap is robustness, and fault-injection testing is where to go. If a wrong answer from one agent keeps turning into three, the gap is containment, and the failure-mode map plus the full research index hold the propagation work. Most teams carry all three gaps and have measured none of them; the map exists so you can attack the one that is costing you, in the order it is costing you.

This site points toward a reliability profiler still in development: an instrument designed to inject a controlled fault, trace how far it cascades across a multi-agent topology, and report a containment rate with a confidence interval. That is design intent. No containment number has been measured, so this page asserts none. The profiler operationalizes the same map this page draws, and the research program records each containment run under its method and interval as the instrument completes it.