Eval reproducibility

Eval reproducibility is getting the same result from an evaluation re-run on the same data and the same parameters; it breaks when uncontrolled non-determinism such as sampling temperature, an unpinned seed, or a drifting judge model moves the score while the declared inputs stay fixed.

Eval reproducibility is getting the same result from an evaluation re-run on the same data and the same parameters. It carries the computational sense the National Academies fixes: “obtaining consistent results using the same input data, computational methods, and conditions of analysis” (Reproducibility and Replicability in Science, National Academies, 2019). For an LLM eval, the score should not move when nothing you controlled changed.

Three sources move it anyway. Sampling temperature above zero makes generation stochastic; an unpinned random seed reshuffles the draws; and a judge model quietly reversioned between runs scores the same output differently, known as judge drift. Pinning temperature and seed narrows the first two, yet batched floating-point execution is not bit-identical run to run, so a frozen pipeline can still flip a borderline case.

Reproducibility is the precondition for every other eval-rigor claim you make.

It is distinct from construct validity, whether the eval measures the capability it claims to, and from eval confidence intervals, which quantify the run-to-run variance that weak reproducibility produces. A score that shifts on identical inputs is what an interval reports, and settling how many runs it takes to trust one is the work of measuring agent reliability.