Reliability testing

How to measure agent reliability past a single pass rate

How to measure agent reliability with metrics that capture consistency, not just capability: pass@k versus pass^k, a reliability@k suite aggregate, and a confidence interval on every rate.

Your agent cleared the eval suite on the first run, and someone on the call wants to know whether that makes it reliable enough to ship. The pass rate you just quoted does not answer that. It tells you the agent can do the task once; it says nothing about whether the same task succeeds the next hundred times a user triggers it, or how far that rate would move if you re-ran the suite under a different seed. So the number you report next has to capture how consistently the agent repeats that success, which is something the first-run pass rate never measured.

The measurements a single pass rate leaves out

Reliability and capability are two different measurements, and a single pass rate reports only the second. A number that answers the capability question can look excellent while consistency is failing, so the first move is to stop reporting one metric and start reporting the set below. The five-metric table below equips you to report consistency instead of a single pass rate: each quantity, what it actually answers, how it is computed on stated inputs, and when to reach for it.

MetricWhat it answersHow to compute (on stated inputs)When to reach for it
pass@kCan the agent solve the task at all, given k tries?Unbiased estimator over n ≥ k samples with c correct: 1 − C(n−c, k) / C(n, k) (Chen et al., 2021)You allow retries or best-of-n sampling and want the capability ceiling
pass^kDoes it succeed on every one of k runs?Fraction of k-run windows with zero failures; under independence, p^k for a stated per-run rate pProduction reruns the same task and one failure is visible to a user
reliability@kWhat share of the suite survives k runs clean?Mean of pass^k across the task suite, reported with a confidence intervalThe headline consistency number you report for the whole system
Per-run rate p̂ + CIHow sure are you of any single rate?p̂ = c/n; 95% interval ≈ p̂ ± 1.96·√(p̂(1−p̂)/n), Wilson near 0 or 1Before you trust or quote any single pass number
Variance across seedsHow much does the number move on a re-run?Re-run the suite under new seeds; report the between-seed spread or a resampled standard errorBefore ranking two systems whose scores sit close together

Read the first two rows against each other, because pass@k and pass^k answer two different questions about the same agent. Everything below is how you turn the right one into a number you can put an interval on.

pass@k and pass^k answer different questions

pass@k comes from code generation, where you sample the model several times and keep any solution that works. With k samples drawn per problem, a problem counts as solved if any one of them passes, and pass@k is the fraction of problems solved (Chen et al., 2021, arXiv preprint). Estimated from a finite budget of n ≥ k samples with c correct, the unbiased form is 1 − C(n−c, k) / C(n, k); plugging an empirical pass@1 into 1 − (1−p̂)^k instead is biased, which the same paper shows and corrects. Repeated sampling is potent: Codex solved 70.2% of the HumanEval problems with 100 samples per problem, against 28.8% from a single sample, point estimates the paper reports without confidence intervals.

That potency is the trap when you care about reliability. pass@k rewards an agent that gets there once in many tries. Production inverts the incentive: it runs the same task again and again, and one failure in a hundred is a failure a user sees. The reliability-facing counterpart is pass^k, the probability that an agent succeeds on all k independent attempts. Under independence with a stated per-run success probability p, pass@k equals 1 − (1−p)^k while pass^k equals p^k.

Take p = 0.8 as a stated, schematic per-run rate and set k = 3. pass@3 is 1 − 0.2³ = 0.99, and pass^3 is 0.8³ = 0.51 (computed from stated assumptions, not measured).

The same agent reads as 99% capable and 51% reliable off one run, decided entirely by which metric you chose to print. The gap only widens with k: capability climbs toward one as tries accumulate, while consistency decays geometrically. That decay is the same multiplication that makes per-step reliability compound across a multi-agent chain into a much lower end-to-end number, which is where the full compounding arithmetic lives. pass^k is that curve applied to repeats of one task instead of hops across many.

reliability@k: the number you report for the whole system

One task is not a system, so the metric you hand a stakeholder has to aggregate. We define reliability@k as the mean of pass^k across your task suite: the expected fraction of representative tasks that survive k consecutive runs with zero failures. It answers the question a leader is actually asking: how much of what we ship holds up under repetition.

A thresholded variant is often more faithful when the stakes are uneven. Set a bar per task, count a task as reliable only when its pass^k clears it, and report the fraction of tasks that clear the bar. A payment action might demand pass^k above 0.99; a draft-email action might accept far less. Averaging pass^k treats those the same, so the thresholded form keeps a few catastrophic tasks from hiding behind many easy ones.

A rate without an interval is one sample standing in for a distribution

Every rate above is an estimate from a finite number of runs, which means it carries uncertainty, and a rate reported without that uncertainty is one sample standing in for a distribution. Evaluations are experiments, and the statistics of experiments apply directly: every reported score needs the interval around it (Miller, 2024, arXiv preprint).

Suppose you observe 16 successes in 20 runs of a task. The point estimate is p̂ = 0.8. For a binary outcome the standard error is √(p̂(1−p̂)/n) = √(0.8 × 0.2 / 20) ≈ 0.09, so a 95% normal-approximation interval on the per-run rate is roughly 0.8 ± 0.18, or [0.62, 0.98] (computed from stated assumptions, not measured). Near 0 or 1 that approximation breaks down, and a Wilson interval is the safer default.

Now carry that interval through pass^3. Because p³ is monotone in p, the endpoints map straight across: pass^3 lands somewhere in [0.62³, 0.98³], which is [0.24, 0.94]. The reliability figure you were about to report as 51% is, on twenty runs, consistent with anything from 24% to 94%. That interval, not the point estimate, is the number you can defend, and it is precisely what a bare pass rate omits. It is also the same interval discipline a containment rate carries by definition rather than as an afterthought. Tightening it takes more runs, and since the width shrinks with √n, halving the interval costs roughly four times the samples.

The variance a single run hides

Even that interval is optimistic, because it treats the only randomness as within a single run. Re-run the same suite under a different random seed, a different prompt ordering, or a different sampling temperature, and the point estimate shifts on its own. A single run cannot observe that component of the variance at all, so a confidence interval computed from one run understates how much the headline can move.

The fix is to resample: run the whole suite several times under different seeds, then compute the spread from the per-suite means across those resamples. Report that between-seed standard error alongside the within-run one, so a difference between two systems has to clear the noise before it counts as a result. Seed-to-seed variance is a distinct component of the total uncertainty that a within-run interval cannot capture, so leaving it out understates how much two close scores actually overlap, which is why it is the deciding factor when you compare two architectures on their cascade resistance and the scores sit close.

How to report a reliability number that holds

The reporting rule is short. Quote pass^k whenever production reruns the same task and a single failure reaches a user, and reserve pass@k for the genuine best-of-n case where retries are allowed. Aggregate pass^k to reliability@k across a representative suite, thresholded when the stakes are uneven. Attach a confidence interval to every rate, then widen it with the seed-to-seed spread before you compare systems or promise a figure to a stakeholder. If the interval is too wide to decide on, the fix is more runs; rounding the point estimate only hides the width.

This site points toward a reliability profiler that will estimate these metrics for your own agent against field norms and report each with a confidence interval, which is design intent, and no measured number is claimed here. The vocabulary behind the metrics lives in the reliability glossary, and the published runs, with their methods and intervals, are where numbers like the ones above stop being schematic and start being measurements you can cite.