Eval rigor

Bias-correct your LLM-as-a-judge eval before reporting it

An LLM judge is an imperfect classifier, so its raw pass rate is biased. Correct it with the judge's sensitivity and specificity, then report a calibration-aware confidence interval.

Reporting an LLM judge’s pass rate honestly comes down to a method: bias-correct the raw rate with the judge’s measured sensitivity and specificity, then report a confidence interval that carries both the test-set and the calibration-set uncertainty. A judge is an imperfect classifier, so its raw pass rate is the judge’s verdict, not the model’s accuracy, and the two differ by a bias you can measure and remove. Whether that pass rate is headed for a model card, a release gate, or a slide, this page is the procedure for reporting it end to end.

The reporting procedure, worked end to end

Reporting an LLM-judge score correctly is a five-step recipe, and every step is a number you can compute from data you already have or can cheaply collect. The table below sequences the full reporting procedure: each step, the quantity it produces, a worked value on deliberately round inputs, and what breaks if you skip it. We call this statistical recipe plug-in bias correction (distinct from the per-bias judge-bias correction protocol), and the interval it ends on an eval-CI, to keep the correction and its uncertainty distinct. The statistics are standard: a bias-adjusted estimator long used in epidemiology (the Rogan-Gladen correction), applied to LLM judges and given a calibration-aware interval by Lee and colleagues (How to Correctly Report LLM-as-a-Judge Evaluations, ICML 2026, as of 2026-07).

StepQuantity it producesWorked schematic valueJudgment: what breaks if you skip it
1. Calibrate the judgeSensitivity q̂₁ = P(judge says pass | truly pass) and specificity q̂₀ = P(judge says fail | truly fail), from a human-labeled setq̂₁ = 0.90, q̂₀ = 0.85You assume a perfect judge, and every number below inherits the judge’s error as invisible bias
2. Read the naive ratep̂ = judge-passes / test itemsp̂ = 700 / 1,000 = 0.70Where most reports stop; a biased estimate of true accuracy whenever q̂₀ + q̂₁ < 2
3. Bias-correctθ̂ = (p̂ + q̂₀ − 1) / (q̂₀ + q̂₁ − 1)θ̂ = 0.55 / 0.75 ≈ 0.73Recovers the true rate; needs q̂₀ + q̂₁ > 1 (a judge better than a coin) or the denominator collapses
4. Build the eval-CIPropagate test-set and calibration variance through the estimator (delta method)SE ≈ 0.037; 95% CI ≈ [0.66, 0.81]An interval from the test set alone (±2.8 pp) is far too tight and centered on the wrong number
5. Allocate calibration labelsSplit the human-label budget by each class’s weighted variance, adaptively≈ 2.3× more labels on the pass class hereAn even split leaves the dominant variance term wide and wastes labeled data

Every worked number on this page is arithmetic on the schematic inputs in the table above, not a measurement of any system. Those inputs are round on purpose so you can re-run the numbers yourself.

Why more test data can’t rescue a biased judge

Run-to-run variance is one kind of uncertainty: re-run a suite and the score moves, which is the problem the reporting-a-rate-with-its-interval note works through for a single pass rate. A biased judge is a different problem, and it does not shrink with more test items. A judge that mislabels some real passes as fails and some real fails as passes shifts the expected pass rate away from the truth by a fixed amount set by its error rates, so a larger test set only converges more tightly onto the wrong number.

Write the judge’s true positive rate (its sensitivity) as q₁ and its true negative rate (its specificity) as q₀. If the genuinely-passing fraction of outputs is θ, then the rate at which the judge says pass is p = (q₀ + q₁ − 1)·θ + (1 − q₀), because it passes a real pass with probability q₁ and false-passes a real fail with probability 1 − q₀. Lee and colleagues formalize exactly this relationship for LLM judges and show the naive score carries positive bias at low θ and negative bias at high θ (ICML 2026, as of 2026-07).

The two sides meet at a single crossover. With q₀ = 0.85 and q₁ = 0.90, the naive rate equals the truth only at θ = 0.60, overstating accuracy below that point and understating it above (computed from the stated assumptions, not measured). Our worked p̂ = 0.70 sits in the understating region, which is why the correction raises the estimate.

This is why a judge that agrees with humans on, say, 88% of items is not a clean bill of health. Agreement and accuracy are different quantities, and a judge at that agreement level can move a reported score by several points in a direction that depends entirely on where the true rate sits.

The correction: invert the judge

Bias correction inverts that linear relationship. Solve p = (q₀ + q₁ − 1)·θ + (1 − q₀) for θ and you get the plug-in estimator θ̂ = (p̂ + q̂₀ − 1) / (q̂₀ + q̂₁ − 1), with the judge’s sensitivity and specificity estimated from the calibration set. On the worked inputs that is (0.70 + 0.85 − 1) / (0.85 + 0.90 − 1) = 0.55 / 0.75 ≈ 0.73.

The denominator q̂₀ + q̂₁ − 1 is the judge’s informativeness, the same quantity as Youden’s J. It has to be positive, meaning the judge beats a coin, or the inversion is undefined. It also does the damage in the next step: the smaller it is, the harder a weak judge amplifies every source of uncertainty, because everything gets divided by it.

The eval-CI: an interval that admits you measured the judge

The point estimate moved. The interval has to move further, because θ̂ now depends on three estimated quantities, each carrying its own sampling error: the naive rate p̂ from the n test items, and the two calibration rates q̂₀ and q̂₁ from the human-labeled set. Propagating all three through the estimator with the delta method gives a standard error that decomposes cleanly:

Var(θ̂) ≈ [ Var(p̂) + (1 − θ)²·Var(q̂₀) + θ²·Var(q̂₁) ] / (q₀ + q₁ − 1)²

On the worked inputs, with n = 1,000 test items and 100 human labels in each calibration class, the standard error is about 0.037, so the 95% eval-CI is roughly [0.66, 0.81]. Report the test set alone, as if the judge were perfect, and you would quote 0.70 with a 95% interval of about [0.67, 0.73]. The corrected interval is centered three points higher and about two and a half times wider (computed from the stated assumptions, not measured).

Reporting propertyNaive judge pass ratePlug-in bias-corrected eval-CIWhich to trust
Point estimate0.700.73The corrected one; the gap is the judge’s bias, which more test data will not remove
What the interval coverstest-set sampling onlytest-set + judge sensitivity/specificity + calibrationCorrected; a real report carries all three sources
95% interval width±2.8 pp±7.3 ppThe wider one is honest; the tight one is false precision
Assumption about the judgejudge is perfectjudge measured on a calibration setCorrected; no LLM judge is perfect
Under test/calibration shiftbiasedunbiased (Lee et al., ICML 2026)Corrected; the calibration mix need not match the test mix

The last row is the plug-in’s structural advantage. Sensitivity and specificity are defined conditional on the true label, so they do not depend on how many passes happen to be in a given set, and the correction stays unbiased even when the calibration set’s mix of passes and fails differs from the test set’s (Lee et al., ICML 2026, as of 2026-07). A raw agreement number never survives that shift.

The base mechanics of putting an interval on a single proportion are the same ones the companion note covers, including why a Wilson interval beats the textbook normal one near 0 or 1. NIST makes that boundary point directly: a method that “produces a lower limit which is an impossible value for the parameter for which the interval is constructed is an inferior approach” (NIST/SEMATECH e-Handbook 7.2.4.1, as of 2026-07). The eval-CI adds the two calibration terms on top of that base.

Where to spend your human labels

The eval-CI also tells you how to make itself tighter, and the answer is rarely an even split of your human labels. Look at the two calibration terms in the variance: the specificity term is weighted by (1 − θ)² and the sensitivity term by θ². When the true rate is high, θ² dwarfs (1 − θ)², so the sensitivity estimate q̂₁ dominates the width. On the worked inputs the sensitivity term is about 62% of the total variance and the specificity term about 12%, even though both classes got the same 100 labels.

So move labels toward the variance. Minimizing the total variance for a fixed human budget puts the class counts in proportion to the square root of each class’s weighted variance, which here is roughly 2.3 to 1 in favor of the pass class: about 140 pass labels and 60 fail labels out of a 200-label budget (computed from the stated assumptions, not measured).

Because θ is the very thing you are estimating, do this adaptively. Label a small pilot, get rough sensitivity, specificity, and θ̂, then spend the rest of the budget where the pilot says the interval is widest. Lee and colleagues use exactly this adaptive allocation to buy tighter intervals from the same number of human labels.

This is the calibration-side analogue of sizing your run count to the decision, which the reliability-testing pillar covers for the test set. One budget buys test items, the other buys human labels, and both should be spent against the term that actually widens your interval. A corrected per-task rate is also the right input to any suite-level aggregate: a metric like reliability@k treats each run’s pass or fail as ground truth, so feeding it a judge’s uncorrected labels lets the same bias into the aggregate one task at a time. Correction belongs upstream of the aggregation.

What the plug-in does not fix

Two limits keep the recipe in its lane. The Rogan-Gladen correction and its interval tend to overcover when the calibration set is small, so on a handful of human labels the eval-CI runs conservative; Chen and colleagues document this and propose estimators that recover the lost efficiency (Efficient Inference for Noisy LLM-as-a-Judge Evaluation, arXiv:2601.05420, preprint, not yet peer-reviewed, as of 2026-07). Those newer prediction-powered estimators report materially narrower intervals than the plain plug-in on the same data, so treat the recipe here as a conservative floor that a stronger estimator improves on.

The correction also only removes the bias you can measure. It assumes the calibration labels are themselves correct and that sensitivity and specificity are stable across the population you are scoring. And one failure the math cannot touch is a judge grading output from its own model family, where its errors skew systematically kind toward output that looks like its own. Keep the judge independent of the system under test, for the same reason a verifier that produced an answer cannot be trusted to grade it, worked through in the analysis of why verification failures escape a multi-agent system.

The reporting rule

The rule is short enough to adopt on your next eval. Do not report a judge’s raw pass rate as the model’s accuracy: correct it with the judge’s measured sensitivity and specificity, attach an eval-CI that carries the calibration uncertainty, and put your next human label on the class that interval says is widest. Without the correction and the interval, a judge’s pass rate describes the classifier, and the release gate reads it as the model’s accuracy.

The reliability profiler this site points toward is designed to run this correction and report the interval against field norms; pre-launch, that is design intent, and no corrected number has been measured, so the page puts forward none. The vocabulary behind these metrics and the research program are where this goes next: the first pins the terms, and the second carries that correction from schematic to a measured rate.