reliability@k and pass^k

pass^k is the probability an agent solves all k runs of one task (closed form p^k). reliability@k is the lane's suite-level aggregate of pass^k: the mean across a representative task suite. It is the consistency counterpart to pass@k (best-of-k capability), not its inverse.

By LatentEval Published 2026-04-01

pass^k is the probability that an agent system succeeds on all k independent runs of the same task; its closed form is p^k under a stated per-run success rate p. reliability@k is the lane’s suite-level report of this consistency view: the mean of pass^k across a representative task suite. The per-task number falls as k grows, because demanding success on more runs is a harder bar. It is the consistency counterpart to pass@k, the capability metric from code generation, where k samples are drawn and a task counts as solved if any one passes the unit tests (Chen et al., 2021, arXiv:2107.03374, preprint).

pass@k measures the best of k tries; reliability@k measures the tries you cannot afford to lose.

The gap is operational. At a per-run success rate of 0.9, pass@3 reaches 1 - 0.1^3 = 0.999 while pass^3 is 0.9^3 = 0.729 (computed from stated assumptions, not measured). The same system clears 0.999 when one success is enough and holds all three runs with probability 0.729.

A single-run pass rate reports neither. See how to measure agent reliability for the k-run estimator, agent reliability testing for the protocol, and eval confidence interval for the interval every k-run number still needs.