Eval rigor

Is your eval difference statistically significant?

Two eval runs a few points apart. Separate a real gain from run-to-run noise with a paired McNemar test on the same items: a p-value and a confidence interval on the pass-rate delta.

Two runs on the same suite: 87 percent for the new prompt, 80 percent for the baseline, and a “+7 points” line half-written into the release note. One question decides whether those seven points mean anything: is the difference a real improvement, or the same system landing differently twice? Real release gates get built on deltas this size. An audit of two public LLM leaderboards found that many reported pairwise rankings never clear a paired significance test at the usual thresholds, with 11 of 40 comparisons on one and 4 of 9 adjacent-rank pairs on another unresolved at 5 percent significance and 80 percent power (Resolution Diagnostics for Paired LLM Evaluation, arXiv:2605.30315, preprint, single-study, as of 2026-07).

A reported delta is a hypothesis to test, not a number to eyeball. When both runs scored the same items pass or fail, the right instrument is McNemar’s test, and it converts your two runs into a p-value plus a confidence interval on the pass-rate delta. This page is that procedure, worked on numbers you can re-run.

Pick the test that matches your eval design

Which significance test you run is fixed by the shape of your data: whether the two runs scored the same items, and whether each item is scored pass/fail or on a continuous scale. Choose the wrong one and the p-value answers a question you did not ask. The table below is the decision surface, and the rest of the page works its top row end to end.

TestData shape it fitsWhat it answersWhen to reach for it
McNemar’s test (exact binomial or chi-squared)Paired pass/fail on the same items, two runs or two systemsIs the pass-rate delta bigger than the disagreements alone would produce by chance?Default for comparing two runs on one frozen suite; use the exact binomial when the discordant count is small
Two-proportion z-testTwo independent samples (different items, or unshared runs)Do two separate pass rates differ?Only when you cannot score the same items twice; it cannot cancel shared item difficulty, so it is weaker
Paired t-test or paired bootstrapPaired continuous scores (rubric 0 to 1, similarity, latency) on the same itemsIs the mean score difference nonzero?When the metric is not binary; bootstrap when the score distribution is skewed
Cochran’s Q, then pairwise McNemarPaired binary outcomes across 3+ runs or systemsDoes any configuration differ from the others?Comparing more than two prompts or models; follow a significant Q with corrected pairwise tests
Bootstrap on the deltaAny paired metric with an awkward closed form (compound scores, ratios)A distribution-free interval on the deltaWhen you want a CI without leaning on a normal approximation

The row that fits an ordinary eval regression is the first one. You froze a suite, ran two prompts or two model versions against the identical items, and scored each item pass or fail. That is paired binary data, and McNemar’s test is the standard tool for it (McNemar, 1947): it discards every item both runs agreed on and rides entirely on the discordant pairs, the items one run passed and the other failed.

Work the numbers on one suite

Take a frozen suite of 200 items, a baseline run, and a candidate run, each item scored pass or fail. Cross-tabulate the two runs against each other and every item lands in one of four cells: both pass, both fail, or one of the two mixed outcomes that carry the whole test.

Candidate runBaseline passedBaseline failed
Candidate passed157 (both pass)17 (candidate gained)
Candidate failed3 (candidate lost)23 (both fail)

The margins give the headline: the baseline passed 160 of 200 (0.80), the candidate passed 174 (0.87), a delta of +7 points. But 180 of the 200 items agreed in both runs. The comparison lives on the 20 that disagreed: 17 items the candidate flipped from fail to pass, and 3 it flipped the other way, a net of 14 items.

McNemar’s statistic reads only those two discordant cells. The chi-squared form is (b − c)² / (b + c) = (17 − 3)² / 20 = 9.8, which on one degree of freedom gives a two-sided p of about 0.002. With 20 discordant pairs the sample is under 25, so the chi-squared approximation is shaky and the exact binomial test is the one to trust here; mlxtend’s evaluation docs make the rule explicit, recommending the exact version “for sample sizes < 25 since chi-squared is not well-approximated by the chi-squared distribution” (mlxtend evaluate documentation, as of 2026-07). The exact test asks whether 3 of 20 discordant flips going the wrong way is surprising under a fair coin, and returns p ≈ 0.003.

The p-value says the delta is real. The interval says how big it is. The 95 percent confidence interval on the pass-rate difference is +7 points give or take about 4.3, or [+2.7, +11.3] points, and it clears zero, so the gain survives. That width is a large-sample Wald approximation. In the same under-25 discordant regime that ruled out the chi-squared test, a Wald interval undercovers, so the interval consistent with the exact test is a score-based one for paired proportions (Newcombe’s method, 1998), and that is the version to quote once the discordant count is this small.

The unpaired test reverses that verdict on the identical data. Treat the two runs as independent samples and run a two-proportion z-test, the comparison NIST’s handbook frames for large samples with “the normal approximation to the binomial to develop a test similar to testing whether two normal means are equal” (NIST/SEMATECH e-Handbook 7.3.3, as of 2026-07). That test returns z ≈ 1.89 and p ≈ 0.06, and its interval on the delta runs from −0.2 to +14.2 points, swallowing zero.

AnalysisWhat it usesTwo-sided p95% CI on the +7 pt deltaVerdict at 0.05
Unpaired two-proportion z-testBoth runs as independent samples≈ 0.06[−0.2, +14.2] ptNot significant; the interval includes zero
Paired McNemar (exact p; Wald CI)Only the 20 discordant items≈ 0.003[+2.7, +11.3] ptSignificant; the delta is bounded above zero

Same 200 items, same +7 points, opposite conclusions. Every figure in this section is arithmetic on the counts in the table above, computed from stated assumptions and not measured on any system.

Why pairing flips a borderline result into a clear one

The two tests disagree because they are estimating different variances. The unpaired test treats each run as a fresh draw and so carries the full spread of item difficulty in both samples: easy items and hard items, lumped into one noisy pass rate per run. Most of that spread is common to both runs, since a hard item tends to be hard for both, and the unpaired test has no way to subtract it.

Pairing subtracts it by construction. An item both runs pass, or both fail, tells you nothing about which run is better, so McNemar throws it out and the shared difficulty leaves with it. What remains is the signal: among items where the two runs disagreed, did the candidate win more often than a coin would predict? That is why the 90 percent concordance here is the noise you got to cancel.

This is the same lever that makes a paired A/B test efficient in any domain: hold the nuisance variation fixed and measure only the effect. It is also why how many runs a suite needs for a target power drops sharply once you pair the comparison against a shared, frozen input set rather than resampling independently. The interval you end on is the paired analogue of putting a confidence interval on a single pass rate: the point estimate is the delta, and the width admits that both runs are stochastic.

What to report, and where the p-value goes quiet

Report the interval, and the p-value beside it. A p of 0.003 says the delta is unlikely to be noise; whether it is worth shipping is a separate question. Significance and effect size answer different things, and a large enough suite will flag a half-point gain as significant while the interval shows it is trivially small. The decision-grade number is Newcombe’s score interval on the paired delta, approximated by the hand-reproducible [+2.7, +11.3] point Wald figure; both clear zero, so the conclusion holds, and that interval names the floor and the ceiling of what you actually bought.

Two upstream steps keep the test honest. If a judge model scored the items, its errors bias each run’s pass rate before you ever difference them, so bias-correct the judge before you compare its scores; a shared judge bias can cancel in the delta, but an unstable one inflates the discordant counts and muddies the test. And the whole procedure assumes the two runs are otherwise comparable, which holds only when the suite is reproducible across seeds and environments rather than drifting under you between the baseline and candidate runs. The general habit of never quoting a rate without its interval is worked through in the note on reporting a reliability number with its interval.

When McNemar is the wrong call

The test earns its answer only inside its assumptions, and three boundaries decide whether it applies at all.

The first is the design. McNemar needs the same items scored twice; if the two runs saw different samples, the pairing is fictional and the unpaired two-proportion test is the honest choice, at the cost of the power you just saw pairing recover. The second is the metric. A pass/fail label is what McNemar consumes; a continuous score, a rubric average, or a latency wants a paired t-test or a paired bootstrap on the mean difference instead. The third is the count of systems. Comparing three or more prompts pairwise inflates the false-positive rate, so screen with Cochran’s Q first and correct the follow-up comparisons.

One caveat rides along even when the design is right. McNemar assumes the discordant pairs are exchangeable, that a fail-to-pass flip and a pass-to-fail flip are equally informative under the null. Correlated items break that, most commonly when your suite has near-duplicate cases that move together, which lets a handful of correlated flips masquerade as many independent ones. Cluster the suite and test at the cluster level when that is a risk. The paired-leaderboard audit above found exactly this kind of inflation: its unresolved-pair count rose once real subject-level clustering was accounted for (arXiv:2605.30315, as of 2026-07).

The decision rule

Before you write a delta into a release gate, run the test its design demands. For two runs over the same frozen suite scored pass or fail, that is McNemar: read the discordant pairs, take the exact binomial p when they number under 25, and report the confidence interval on the pass-rate difference next to it. Ship on the interval rather than the point estimate, and treat a significant but tiny delta as what it is, a real effect too small to matter.

The reliability profiler this site is building toward is designed to run this paired comparison across a frozen suite and report the delta with its interval against field norms. That capability is design intent; pre-launch, no such comparison has been measured, so no significance figure here describes it, and every number above is arithmetic you can reproduce from the stated counts. The reliability vocabulary defines each metric the test consumes, and the research program posts each measured delta beside the test and interval that produced it.