Eval rigor

How many runs a reliable eval needs to catch a regression

How many runs a reliable eval needs is a power calculation set by the regression you must catch, your target power, and the baseline pass rate. Includes a runs-needed table and the formula behind it.

Before you can trust a green re-run, you have to decide the smallest regression you would actually roll back a release for, then buy enough runs to see it. That number is not a matter of taste. It falls out of a power calculation with three inputs: the effect size you must catch (how many points of pass rate the regression costs), the power you want (the probability the eval flags a real regression of that size), and the baseline pass rate you are measuring against. Fix those three and the runs needed is determined. A single pass rate, however green, answers a different question than “would this eval have caught the regression you care about.”

Runs needed to catch a regression, by effect size and power

The table below is the working artifact of this page. It assumes a baseline pass rate of 90%, a one-sided test at the conventional 5% false-alarm rate, and reports the runs a candidate build needs so that a regression of the stated size trips the test with the stated probability. The values are computed from the sample-size-for-a-proportion formula on those stated assumptions, not measured on any system.

Regression to catch (from a 90% baseline)Runs at 80% powerRuns at 90% powerWhat the number tells you
10 points (90% → 80%)~70~100A gross regression is cheap to catch; a few dozen runs cover it
5 points (90% → 85%)~250~360The common “did the last change hurt us” band; budget hundreds
2 points (90% → 88%)~1,500~2,100Fine regressions demand four-figure run counts
1 point (90% → 89%)~5,700~8,000Sub-point drops are effectively undetectable at any realistic budget

Two things jump out of the columns. Detectable resolution is expensive: halving the regression you want to catch roughly quadruples the runs, because the count scales with the inverse square of the effect size. Power costs too, though less steeply, with the move from 80% to 90% power adding about 40% more runs across every row.

The operational reading is blunt. If your CI budget funds 100 eval runs per build, this table says you can catch a 10-point collapse and almost nothing subtler, so report that as the regression your eval is actually powered to detect.

The math behind the table

The count comes from the standard sample-size formula for a proportion, the same one an engineer uses to size a defect-rate acceptance test (NIST/SEMATECH e-Handbook, 7.2.4.2 Sample sizes required, primary, as of 2026-07). Write p0 for the baseline pass rate, p1 for the degraded rate you want to be able to flag, and δ for the gap between them, defined “as the change in the proportion defective that we are interested in detecting” (same source). For a one-sided test at significance α and power 1 − β:

n = ( z(1−α) × √(p0·q0) + z(1−β) × √(p1·q1) )² / δ² where q = 1 − p and δ = p0 − p1.

Take the 5-point row. Baseline p0 = 0.90, degraded p1 = 0.85, so δ = 0.05. At α = 0.05 the critical value z(0.95) is 1.645; for 80% power z(0.80) is 0.842. The numerator is (1.645 × √0.09 + 0.842 × √0.1275)² = (0.493 + 0.301)² ≈ 0.630, and dividing by δ² = 0.0025 gives ≈ 252 runs. That is the arithmetic under the “~250” cell, and every other cell is the same formula with different inputs, reproducible by hand.

One assumption is doing quiet work here: this count treats the baseline as a fixed, well-established reference, such as a large historical run you trust. Compare two equally noisy runs instead, a fresh candidate against a fresh baseline, and the two-sample version of the formula roughly doubles the per-arm budget, taking the 5-point row from about 250 to about 540 runs on each side.

The formula is also a normal approximation, and it runs optimistic in exactly the regime evals live in, high pass rates near the ceiling and small effect sizes. NIST adds a continuity correction of 1/δ (another 20 runs on the 5-point row), and near a pass rate of 1 a boundary-respecting method such as Wilson or an exact binomial, the same correction the eval confidence interval leans on, pushes the count higher still. Treat every number in the table as a planning floor that those corrections only raise.

Why a higher baseline pass rate is cheaper to defend

The baseline rate does real work in the formula: it sets the variance you are fighting. A proportion’s variance, p(1 − p), is largest at 50% and shrinks toward either edge, so catching the same absolute drop costs fewer runs when your eval already passes most of the time. Holding the regression at 5 points and power at 80%, the runs needed move with the baseline like this.

Baseline pass rateRuns to catch a 5-point dropWhy it moves
50%~620Maximum variance; the most expensive place to measure
70%~530Still deep in the high-variance middle
80%~420Variance easing as the rate climbs
90%~250The common target band for a shipped agent
95%~150Near the ceiling, the same drop is four times cheaper than at 50%

The practical takeaway is counterintuitive. The better your system already scores, the fewer runs you need to notice it slipping by a fixed margin. A suite hovering around 50% is the hardest thing to certify, which is one reason a saturated-looking benchmark can still be the more measurable one.

Sizing the runs, and the test that comes after

Two questions sit back to back and get confused. Sizing the runs is the prospective question this page answers: before spending any compute, how many runs make a δ-sized regression visible with a chosen probability. Asking whether an observed gap between two pass rates is larger than run-to-run noise would explain is a significance test, run on data you already have. The two share a distribution but invert the unknown. Here you fix the effect and the power and solve for the number of runs; a significance test fixes the runs and the data and solves for how surprised to be.

Size the runs first. A significance test on an underpowered eval keeps returning “not significant” whether or not a regression is present, and a starved null looks identical to a true one from the outside.

What more runs cannot buy you

Power only sizes the noise. Adding runs shrinks the interval around a pass rate and lifts your odds of catching a genuine drop, and it does nothing to an error that points the same direction on every run. A judge that rewards verbosity, a benchmark that has drifted, a leaked test item: these shift the number identically whether you run the suite ten times or ten thousand, so a bias-corrected score has to come first, before the runs-needed math means anything. That correction is one of the ways an eval number reads green while lying, and the judge-specific version, with its calibration-aware interval, lives in reporting an LLM-as-a-judge score.

The formula also assumes runs are independent draws. Shared seeds, cached tool responses, or a fixed prompt order correlate them, which shrinks the effective sample below the count you paid for and widens the true interval past what n alone predicts. Varying the seed and re-running is how you expose that correlation, the discipline behind eval reproducibility and the fault-injection protocol that produces reliability numbers you can put an interval on.

Runs-needed sizes one question and leaves another open. reliability@k and pass^k ask a separate thing, the probability an agent clears every one of k runs, and the k-run reliability estimator is where that figure and its interval get produced. Size your runs with the power formula, then report the resulting rate with the interval those pages require.

The rule to take back to your eval budget

Pick the smallest regression you would actually block a release for and treat it as your δ. Set power to 0.8 as a working default, read the runs off the table, and check that number against the compute you can spend per build. If the runs fit, you have an eval powered to catch what you care about. If they do not, you have a decision to make in the open: accept a larger detectable δ, drop to lower power, or split the difference, and then publish the regression size your eval can actually detect rather than implying it catches everything. An eval that quietly runs 30 times and calls a build clean is powered to detect almost nothing, and saying so out loud is worth more than the green check.

The reliability profiler this site is building toward is designed to run this power calculation against field-normal baselines and hand back the runs-needed number with the interval it implies; that is design intent, pre-launch, with no measured figure to report here. For the vocabulary these numbers lean on, the reliability glossary fixes each term, and the research index turns schematic arithmetic like the tables above into measured runs carrying their own intervals.