Eval rigor

LLM evals: which methods to trust and where they lie

LLM evals report whether a model passed. Whether that score is valid is a separate question. This hub maps the eval methods and the four ways an eval number lies, each routed to its fix.

By LatentEval Published 2026-04-01 Updated 2026-05-25

A green eval number certifies one thing exactly: on this run, against this benchmark, graded by this method, the model passed. It does not certify that the number is valid, or that it would survive a re-run, and those two properties are what a release actually rests on. An eval number lies in four measurable ways, and every one of them leaves the score reading green: a biased judge scoring a proxy, variance across seeds, calibration drift as the evaluator shifts underneath you, and a construct-invalid benchmark. This hub sorts the four, treats the eval method you choose as the choice that decides which you inherit, and routes each failure to the page that takes it to the floor. Read it as the orientation layer over the cluster; every failure mode below has a deeper page waiting.

The four ways an eval number lies

Every eval failure that matters reduces to one of four modes, and not one of them shows up as a red score. Each corrupts the number while the number still reads green, which is why a passing eval and a valid eval are separate claims you have to check separately. The table is this hub’s working reference: each mode, the property it quietly breaks, the tell you can already see in the numbers you have, and the verdict, whether more runs fix it and which page owns the correction.

Eval failure mode	What it actually breaks	The tell in numbers you already have	Judgment: fixable by more runs, and who owns the fix
False positives from a biased judge	Construct validity of the score: the judge rewards a proxy, answer order, length, or its own style, that tracks quality on easy pairs and parts from it on the close ones	The verdict flips when you swap answer order, match length, or mask which model wrote which; aggregate agreement stays high while the decisive pairs move	A systematic error, so more runs only estimate it more precisely. Correct it before you trust the ranking, using the judge-bias map
Variance across seeds	Reliability of the score: the same suite returns a different number on a re-run, so one pass rate reports where a single seed happened to land while hiding the spread across seeds	Re-run under a new seed, prompt order, or temperature and the headline shifts; two close systems reorder	A noise error, so repetition plus a confidence interval both sizes and shrinks it. Owned by measuring reliability past a single pass rate
Calibration drift	Temporal validity: the evaluator changes underneath you, so a threshold that certified one thing last quarter certifies something else now	A judge model updated under the same name moves its swap-consistency and length slope; a saturating benchmark stops separating systems	A maintenance task, so re-calibrate on a schedule and re-report through a human-labeled calibration set when the evaluator changes
Construct-invalid benchmarks	Construct validity of the whole test: the benchmark scores a proxy for reliability that saturates while the real failures persist	The leaderboard rank does not transfer to your task; the metric climbs while production incidents hold steady	A benchmark-choice problem; statistics cannot rescue it. Validate the benchmark against your decision, the beyond-eval move in trajectory-level agent evaluation

Read the judgment column as a router. It tells you whether the failure is the kind more runs can shrink and where to go to remove it. The split it turns on, noise against validity, is the organizing idea of the whole cluster, so it is worth making explicit before the methods.

Validity or reliability: the axis that decides your response

Group the four modes by what they break and they fall into two families, and the family sets your move.

Reliability failures are noise. Variance across seeds moves the number without a consistent direction, so repeating the eval and reporting a confidence interval both sizes the wobble and shrinks it with more runs. That statistical care is what turns a raw score into a measurement you can defend. Evaluations are experiments, and the literature on running experiments applies to them directly (Miller, Adding Error Bars to Evals, arXiv:2411.00640, preprint, as of 2026-07).

Validity failures are systematic. A biased judge or a construct-invalid benchmark moves the number in one consistent direction, so more runs estimate that error more precisely and never remove it. The correction is structural, a debiasing step or a better-chosen benchmark, applied before you trust the score at all. Here is the distinction in one line: the score measures whether the model passed; validity asks whether passing meant anything.

This is why the reliability lane reads an eval result as two quantities, its center and its spread, and asks a validity question about the center before trusting either. The spread work is the multi-trial view, reliability@k and the interval every rate carries. The center work is judge and benchmark validity, and it connects to the deeper question of whether the systems these evals grade are reliable at all, which is its own discipline.

Calibration drift sits across the seam. It begins as a validity property, the evaluator was calibrated against some reference, and it becomes a reliability problem the moment the evaluator shifts under a fixed name, so a number that was valid last quarter now measures a slightly different thing. Treat it as ongoing maintenance with a re-calibration cadence.

Eval methods, and the failure mode each one inherits

The choice that sets which of these failures you inherit is the eval method itself. Five families cover most of what teams run. What each family is and what it answers is the companion explainer’s job; the cut here is narrower: which of the four failure modes above each family dodges, and which it still hands you. The table routes each straight to the taxonomy above: the failure modes it is structurally immune to, and the one it still hands you to manage. One mode sits outside the table because it is universal. Variance across seeds rides on every method equally, a property of the stochastic system under test that no choice of grader removes. For the outcome-versus-step-versus-trajectory cut beneath these families, how the granularity of an eval changes what it can see develops that axis in full.

Eval method	Structurally immune to	The failure mode it still hands you
Programmatic / exact-match (unit tests, string or numeric match)	Judge bias and calibration drift: no evaluator model to bias or shift under you	A construct-validity gap in the check, where a right answer reached by a wrong route or a brittle format still passes green
Reference-based (compare to a gold answer or trace)	Judge bias: grading is mechanical against a fixed target	Construct-invalid coverage, where legitimate alternatives the reference never anticipated score as failures and a trace counts inherited errors as fresh
LLM-as-a-judge	Little structurally; it reaches coverage no exact check can	Biased-judge false positives, plus calibration drift as the judge model shifts under a fixed name
Human evaluation	Judge-model bias and calibration drift, since a person grades in place of a model	Seed-like spread from annotator disagreement at the small samples humans can cover
Trajectory / process eval	Endpoint-only blindness: it sees an early wrong step a plausible ending papers over	A construct-invalid definition of a valid path, plus the build cost of encoding one

The method you pick decides which failure mode you sign up for. An LLM judge buys coverage no exact check can reach and imports the judge’s bias and calibration drift with it, which is why the judge-specific reference exists on its own and why its scores need bias-corrected reporting with a calibration-aware interval before they enter a model card. Human labels are both the ground truth the rest approximate and the calibration set that makes a judge trustworthy. Process eval catches the failure the others structurally miss, which is the eval-side face of the system failure taxonomy for multi-agent pipelines.

The cluster map: which page answers which eval question

These pages group because each hardens a different stage of turning an eval into a number you can defend, from choosing the method to reporting the interval. This hub covers evals broadly; the judge-bias reference below is scoped to LLM judges specifically, which is the one distinction worth holding as you read. The list is also the reading order.

What LLM evals are, and the eval types. The starting-point explainer: the four eval instruments and the benchmark-vs-product-eval line, before the rigor this hub adds.
Trajectory-level agent evaluation. The cluster pillar. It scores the path an agent took, including its final answer, and it catches early-step corruption, reward hacking, and interval-less pass rates.
The LLM-as-a-judge bias map. The judge-scoped reference: position, verbosity, self-preference, format, and social biases, each with a detection test and a correction you can run before trusting a ranking.
Reporting a judge score after you bias-correct it. Turns a judge’s raw pass rate into a bias-corrected accuracy with a calibration-aware confidence interval.
Measuring reliability past a single pass rate. pass^k, reliability@k, and the interval every rate needs, including the variance a single run cannot see.
The fault-injection protocol behind the numbers. How controlled faults produce reliability numbers you can put an interval on.

The shared vocabulary these pages lean on, fault injection among the terms, is defined once across the reliability glossary, so each page can reference a metric by the angle it needs instead of re-deriving it.

A note on the numbers you will meet down these links. The load-bearing empirical claims resolve to their sources: a strong judge matching human preference at over 80% agreement, about the rate humans agree with each other (Zheng et al., Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena, arXiv:2306.05685, peer-reviewed, NeurIPS 2023 Datasets and Benchmarks Track, as of 2026-07), and the construct-validity critique of treating a benchmark as a general measure of progress, which set out “to reveal the construct validity issues in their framing as the functionally ‘general’ broad measures of progress” (Raji et al., AI and the Everything in the Whole Wide World Benchmark, arXiv:2111.15366, peer-reviewed, NeurIPS 2021 Datasets and Benchmarks track, as of 2026-07). The worked reliability arithmetic on the member pages is deterministic math on stated assumptions, labeled as such, not a measurement of any system.

The eval you can stand behind

You arrived deciding whether one score was trustworthy enough to ship behind. That decision resolves into two prior ones you can now name: is the number valid, meaning free of a biased judge and built on a benchmark that measures what you care about, and is it stable, meaning reported with the interval that says how far it moves on a re-run. Answer both, and the score becomes evidence you can stand behind when the ship decision is on the table.

This site points toward a reliability profiler, a pre-launch instrument whose design intent is to run these validity and stability checks against your own eval and report each correction with a confidence interval. It is pre-launch and contributes no number of its own here; every figure on this page is a citation you can follow to its source.

The research index is where the schematic arithmetic on the member pages becomes measured runs with their intervals, and the glossary fixes each term those pages share. Start with the method you run today, then find its failure mode in the table above and follow it to the fix.