Reliability testing

Make AI agents reliable in production: a budget playbook

Reliable AI agents in production start with the compounding math: 0.95 per step over 20 steps is ~36% end to end. A topology-aware playbook of levers: retry, fallback, checkpoint, human-gate, degrade.

By LatentEval Published 2026-04-01 Updated 2026-05-25

Your agent workflow clears every demo and misses its reliability target the week it meets real traffic, and the next sprint comes down to one question: harden each step, or build machinery around the steps you already have. The compounding math settles most of that argument before you touch code. A workflow that chains twenty steps at 0.95 reliability each succeeds end to end about 36% of the time. Reliability in production is a budgeting problem, not an accuracy problem, and this page is the playbook for spending the budget.

The five production levers, and when each earns its place

The levers below are the whole playbook on one screen. Each one moves a specific term in the reliability budget: retry and fallback raise the effective reliability of a step, checkpoint caps how much a late failure costs to undo, and the human gate and graceful degradation bound how far a fault travels before a user acts on it. The judgment column carries the decision; the other columns set it up.

Lever	What it buys (the budget term it moves)	Its cost	When to reach for it	Judgment
Retry with jitter	Raises effective per-step reliability against transient faults, pushing that step’s p toward 1	Added latency and compute; on a non-idempotent step it forks state	A timeout, a rate-limit, or a flaky tool call on an idempotent step	Nearly free on transient faults, actively harmful on a confident wrong answer; gate it behind a fault-type check
Fallback	Recovers a fraction of a step’s failures via an alternate path, lifting effective p to 1-(1-p)(1-f)	A second path to build and keep correct; a silent quality drop if the fallback is worse	A genuinely adequate alternative exists for the step	Highest leverage on long chains, because the gain multiplies at every step
Checkpoint	Caps rework and bounds cascade depth by resuming from known-good state	State management and staleness; checkpointing bad state persists the fault	Long, expensive chains where restarting from zero is costly	Lowers recovery cost and leaves correctness untouched; only ever checkpoint validated state
Human gate	Contains blast radius at the exit, catching semantic errors that structural gates wave through	Throughput and latency collapse; alert fatigue erodes the gate over time	High-cost, irreversible actions at low volume	The only lever that reliably catches a confident wrong answer, and the one that scales worst
Graceful degradation	Bounds user-facing blast radius by shipping a correct partial when the full answer would be wrong	Product complexity; you must define an acceptable partial and detect when to fall to it	User-facing paths where a partial answer beats a confident wrong one	The last edge before a fault reaches the user; turns an uncontained error into a contained one

Two patterns run through the column. Retry and fallback are the levers that raise reliability; the other three cap what a failure costs once it has happened. A too-low end-to-end number means you are short on the first kind. A rare failure that is catastrophic when it lands means you are short on the second.

The second pattern is a warning about the lever teams reach for first. Most agent faults are semantic rather than transient, and a retry re-runs a confident wrong answer through the same process. On a non-idempotent step it hands the next agent two authoritative versions to choose between. Adding jitter to backoff is standard practice for remote clients (Exponential Backoff And Jitter, AWS Architecture Blog, primary, as of 2026-07), but jitter only helps the transient case; where retry ports cleanly from service engineering and where it misleads is worked out in the companion piece on porting microservices resilience patterns to agents.

The compounding math that forces the budget

The reason a budget exists at all is multiplication. Picture a workflow as steps running in series, where each step either passes its input through clean or corrupts it. Give every step an independent success probability p, and a chain of N steps lands its end-to-end result with probability p raised to the N. Every figure in this section is deterministic arithmetic on one assumption, independent and uniform per-step reliability, computed from stated assumptions and not measured on any running system.

At 0.95 per step, a twenty-step chain returns 0.95^20 ≈ 0.36. That is the number the live arithmetic forces, and it is why a 95% agent that looks shippable in isolation yields a roughly one-in-three end-to-end success rate once twenty of them run in series. Push each step to 0.99 and the same chain clears 0.99^20 ≈ 0.82; let it slip to 0.90 and you drop to 0.90^20 ≈ 0.12.

Now turn the equation around, and the compounding curve becomes a budget. To hold a twenty-step chain to a 90% end-to-end target, every step has to clear 0.90^(1/20) ≈ 0.995 reliability; a 99% target demands 0.99^(1/20) ≈ 0.9995 per step. That is the per-step reliability budget, computed from the stated assumptions and not measured. Buying its last fraction with model accuracy alone is where most teams stall, and the levers above are how you meet the budget without it.

The independence assumption is a simplification, but not in the direction you might guess. For a series chain at a fixed per-step rate, coupling the steps’ failures (as shared state does) does not push the all-steps-clean rate below the curve, because correlated failures concentrate on fewer runs. What shared state worsens is the cost when a fault lands: it arrives in correlated clusters that hit many steps at once, and it can add a common-cause mode that lowers the effective per-step rate you should budget against. How far a fault travels once it starts, and what fraction your architecture holds to a single hop, is the amplification side of the same story, worked out with the geometric-reach math in error propagation and cascade containment. This page stays on the budget you set against that reach.

Topology decides how the budget composes

The chain is the worst case, and it is the assumption baked into 0.95^20. The same per-step reliability composes very differently once the wiring changes, which is what makes the budget topology-aware rather than a single curve. Three compositions cover most production systems.

Topology	How reliability composes	End-to-end at 0.95 per step, 20 steps (computed)	Judgment: what it costs to hit target
Sequential chain (every step must hold)	p raised to the N; each step multiplies	0.95^20 ≈ 0.36	Brutal at depth; the only moves are raising p or cutting N
Supervisor with fallback (recover a failed step in place)	1-(1-p)(1-f) per step, then raised to the N	at f = 0.5: 0.975^20 ≈ 0.60	One recovery path nearly doubles end-to-end; the most reliability per unit of effort
Redundant fan-out (any of K parallel attempts wins)	1-(1-p) raised to the K, per step	at K = 2: 0.9975^20 ≈ 0.95	Pays latency and compute to raise reliability; wasteful on already-reliable steps

Supervisor with fallback is the highest-leverage row, and the composition arithmetic shows why. A single alternate path that recovers half of a step’s failures moves a twenty-step workflow from 0.36 to 0.60 end to end, computed from the stated assumptions. Redundant fan-out buys more still, though its gains come from spending latency and duplicated compute while the accuracy of each attempt holds flat, so it earns its place only where a step is expensive to get wrong and cheap to run twice.

One caution the table compresses: fan-out that requires every branch to succeed composes like a chain, p raised to the K, and is punishing rather than protective. Only the redundant, any-of-K form buys reliability. Naming which shape your parallelism has is the difference between spending the budget and burning it.

Where to spend: match the lever to the leak

A budget is only useful spent where the money leaks. The diagnosis is a two-step read. First, classify the dominant fault at the leaking step: is it failing transiently, or returning a confident wrong answer? The empirical failure record splits along roughly the same seam. MAST, the first empirically grounded taxonomy of multi-agent LLM failures, developed from 150 expert-annotated traces (kappa = 0.88) and scaled to a 1,600+ trace dataset across 7 frameworks via an LLM-as-judge pipeline, sorts failures into system-design, inter-agent-misalignment, and verification categories (Cemri et al., Why Do Multi-Agent LLM Systems Fail?, arXiv:2503.13657, preprint, as of 2026-07). Transient faults are the minority; most of the catalog is semantic, the class retry cannot touch.

Second, match the lever to that fault and to the wiring. A transient fault on an idempotent step is the one case where retry with jitter is close to free. A semantic fault needs a fallback that reaches a different result, a checkpoint so a late catch does not cost the whole run, or a gate that stops the wrong answer from being read downstream at all.

Which of those you buy depends on the topology. A deep chain wants stage-level fallbacks and a checkpoint at the expensive midpoint. A supervisor wants a meaning check on each worker return before it relays the result. A user-facing path wants graceful degradation as its final edge, so a caught fault leaves as a smaller correct answer and never ships as the wrong one. The failure-class view, which mode bites hardest in which topology, is mapped in the pillar on why multi-agent systems fail; it names the leaks this budget is spent against.

Prove the budget holds: measure it

A budget you set and never check is a guess with decimal places. Three measurements tell you whether the levers are meeting it, and each carries a canonical definition in the reliability glossary.

Containment rate. The share of a fault a lever holds to one hop, read with its interval. A number stripped of its interval proves nothing.
Blast radius. How deep and how wide a single fault gets before a lever arrests it. Log blast radius per injection point as a distribution, and resist collapsing it into one headline figure.
Cascade resistance. How hard a topology damps a fault weighed against how hard it amplifies one. A cascade-resistance reading is what separates a chain whose gates hold from one that compounds under load.

All three come from one procedure: inject a controlled fault at a chosen point, follow it hop by hop across the wiring, and run the injection often enough that the containment rate arrives with error bars. The count of runs falls out of two things, how noisy the results are and how tight a bound you need. A single injection reports one outcome; a rate needs the repetition and the interval behind it. Propagation numbers carry the same intervals eval scores do, which is the footing that lets a lever’s measured gain mean anything.

To be explicit about our own stake: the site points toward a reliability profiler still ahead of launch, designed to inject controlled faults and report a bootstrap interval around the containment rate it measures. That is design intent and nothing more. Because no run has produced that number yet, this page claims none of its own, and every outside figure it cites lands on a source you can open and verify.

Who should skip reliability budgeting

This playbook earns its keep only under compounding. If you run a single agent with one tool call and no hand-offs, the chain is one step long, 0.95^1 is still 0.95, and there is no budget to spend; harden the one step and move on. The same holds for a short workflow whose only failures are transient timeouts, where retry with jitter alone closes the gap and the rest of the levers are overhead.

Budgeting starts paying when steps multiply, when the faults are semantic rather than transient, and when one wrong output reaches something you cannot take back. If that describes your system, the budget is already being spent, whether or not you are the one choosing where. The research program collects the propagation and containment measurements behind these levers, each shipped with its method and its interval the moment it exists.