Eval confidence interval

An eval confidence interval is the range a procedure produces that, across repeated runs of a suite, brackets a metric's true value a stated fraction of the time (say 95%); its width combines a task-set term (closed-form binomial, or bootstrap) with the seed-to-seed spread, which one run omits.

An eval confidence interval is the range around an eval metric, say a pass rate, that a procedure produces to bracket the metric’s true value in a stated fraction of repeated runs of the same suite (say 95% of such intervals cover it). You estimate it from two components: a closed-form binomial interval over the finite task set (or a bootstrap when a closed form is awkward), widened by the run-to-run spread you get from re-running under independent seeds. Run the suite once and you get a point estimate with no width.

A point estimate presents one draw from a distribution as a constant. Two agents scoring 82% and 79% on a single run each are statistically tied when either score swings several points on a re-run.

The variance has two sources: the finite set of tasks the suite samples, and the stochastic run itself (temperature, tool nondeterminism). Seeds expose the second; the first has a closed form. For a pass rate of 0.80 over 100 tasks, the normal-approximation 95% interval is 0.80 ± 0.078 (1.96 × sqrt(0.8 × 0.2 / 100); computed from stated assumptions, not measured), and seed variance widens it further. That normal approximation runs rough near 0 or 1 and at small task counts, where a boundary-respecting interval (Wilson or exact) is the safer default.

Report the interval next to the score, as the containment rate already does and as reliability@k numbers still require. The general convention for reporting a rate with its interval lives in how to measure agent reliability. When an LLM judge produces the score, the interval also has to absorb the judge’s measured error: reporting LLM-as-a-judge evaluations coins a more specific “eval-CI” for that bias-corrected, calibration-aware interval, a specialized variant of the general interval defined here. Companion explainers cover statistical significance and how many runs you need.