Eval rigor

What LLM evals are, and what each type can certify

LLM eval covers four instruments: offline benchmark, LLM-as-judge, human, and online, each answering a different question, plus the benchmark-vs-product line and the rigor behind a trustworthy score.

Someone on your team says the evals pass, and you have to decide what that sentence buys you. Behind the word “eval” sit at least four different instruments: a public benchmark score, a grader model’s verdict, a human’s read, and a dashboard of live traffic. They answer different questions, they fail in different ways, and they license different decisions. Before you trust any of them to gate a release, you need to know which one you are actually holding and what it can certify.

What an LLM eval actually is

An LLM eval is a repeatable procedure that scores a language model, or the system built around it, against criteria you can check: a known answer, a rubric, a constraint, or a stated preference. The scoring can come from an exact match, another model, a person, or production telemetry, but the shape stays constant. You fix a set of inputs, define what counts as good, and produce a number you can compare across versions.

That definition is deliberately broad, because in practice people call all four instruments “the eval” and then argue past each other. The confusion is not academic. It decides what a passing score actually permits.

The first fork settles most of what follows. Are you measuring the model, or the product you built around it?

Benchmark eval vs product eval: the distinction that sets what a pass means

A benchmark eval scores a model against a fixed, shared task set that everyone runs, so the result is a capability ranking. A product eval scores your system against your own workload, so the result is a fitness verdict. The two get conflated constantly, and the conflation is where teams ship a model that tops a leaderboard and fails on their traffic.

DimensionBenchmark evalProduct evalWhich your sign-off needs
What it asks”Is this model capable in general?""Does my system do its specific job?”Product; a strong benchmark score is only a prerequisite
Unit under testThe model in isolationThe whole pipeline: retrieval, tools, prompts, orchestrationProduct; users hit the assembled system
Where the data comes fromA frozen, shared public datasetCases and traces from your own workloadProduct; a generic set rarely matches your distribution
How it goes wrongContamination and saturation: the test leaks into training, top models cluster near the ceilingDrift and spec-gaming: your distribution shifts, or the system games the metric without doing the taskBoth, though product failures are the ones that reach users
What a pass licenses”Worth integrating and testing""Safe to ship this build”Product; only it gates a release

A benchmark tells you a model is capable in the abstract; it does not tell you the system you built around that model is reliable on your work. That gap is the whole reason capability keeps improving faster than reliability: the benchmark measures the component, and reliability is a property of the assembled system under your conditions.

Contamination sharpens the point on the benchmark side. Once a public test set is old enough to have leaked into training data, a high score can reflect memorized answers, so it no longer tracks capability, and the ranking silently stops discriminating between the top models. A product eval on data the model has never seen is the antidote, which is why the serious work of evaluation moves onto your own task distribution the moment the decision matters.

The four instruments people call “the eval”

Both a benchmark and a product eval still have to be scored somehow, and there the second distinction appears: who or what does the grading. Four scoring instruments dominate, and each answers a genuinely different question at a different cost.

Eval typeThe question it actually answersCharacteristic failure modeJudgment: what it cannot certify on its own
Offline benchmark”How does this model rank on a fixed, shared task set?”Contamination and saturation: the set leaks into training and top scores cluster at the ceilingWhether the model behaves on your inputs, your tools, and your task distribution
LLM-as-judge”Does a grader model think each output meets the rubric, cheaply and at scale?”The judge is an imperfect classifier: sensitivity and specificity errors bias the raw rate, on top of position, verbosity, and self-preference biasesTrue accuracy, until the judge is calibrated against human labels and its bias is removed
Human review”Do people with the right context agree the output is good?”Slow, costly, and noisy: inter-rater disagreement and small samples; it does not scale to every releaseReliability across the full input distribution or across repeated runs, since a person scores a slice
Online / production eval”How is the deployed system doing on real traffic now?”Lagged and confounded: you see the failure after users do, and attribution across a pipeline is hardHow a candidate change will behave before it reaches users, because it measures the past

No single row is the eval. Each is a partial view, and treating any one as the whole is how a green dashboard still ships a broken release.

The instruments also lean on each other. An LLM judge is only trustworthy once a human-labeled set has measured how often it agrees with people, so human review is the calibration substrate underneath judge-based scoring rather than a competing option. Online eval, in turn, is the only instrument that sees the real distribution, but it sees it too late to gate the build that caused a regression. The design question is which instrument owns which decision.

The one failure every eval type shares

Whatever instrument produces the number, it produces it from a stochastic system, so that number carries a spread no single run reveals. Re-run the same suite and the score moves. Report it as a constant and the next re-run can quietly reorder your leaderboard.

The size of that gap is measured, not hypothetical. On tau-bench, a tool-use benchmark, “even state-of-the-art function calling agents (like gpt-4o) succeed on <50% of the tasks, and are quite inconsistent (pass^8 <25% in retail)” (Yao et al., tau-bench, arXiv:2406.12045, preprint, as of 2026-07). An agent that clears a task on one attempt clears it on all eight far less often, so a headline read at a single try describes a best case, and the number you can depend on is the one measured across repeated runs.

The general form of that repeated-trial question is pass^k, the probability a system solves a task on every one of k independent tries; reliability@k aggregates that across a representative suite, and tau-bench’s pass^8 is one published instance of pass^k. For a working definition, the rule is narrow and firm: a trustworthy eval reports a rate with a confidence interval, and any comparison between two systems that lacks one stays undecided.

The LLM-as-judge row carries a second, distinct source of error that no amount of test data removes. A judge model is an imperfect classifier, and “imperfect sensitivity and specificity of the LLM judges induce bias in naive evaluation scores” (Lee et al., How to Correctly Report LLM-as-a-Judge Evaluations, ICML 2026, peer-reviewed, as of 2026-07). The fix is to calibrate the judge against human labels, then correct the score and attach a calibration-aware interval, while watching for the position, verbosity, and self-preference biases a judge brings to a rubric. Calibration and an interval are what separate a score from a measurement.

Which eval answers your question

The four instruments form a stack, where each layer depends on the one before it. An offline benchmark filters which models are worth integrating at all. Product evals on your own task distribution decide whether a specific build is safe to ship. An LLM judge scales the grading in between, once it has been calibrated. Online eval catches what all of them missed, after the fact.

The trap is answering a product question with a benchmark number, or a reliability question with a single run. A leaderboard rank cannot tell you your pipeline holds together, and one pass rate cannot tell you how often it holds. When an agent takes multiple steps, the harder version of product eval scores the whole path it took, including every intermediate action, which is worked out in the pillar on evaluating agents across the whole trajectory. The mechanics of sizing your run count and reading the resulting interval live in how many runs a decision actually needs.

Start from the question, then pick the instrument

Before you accept an eval as evidence, ask three things: which of the four instruments produced it, whether it measured the model or your system, and what interval sits around the number. If the answer to the last one is “none,” you are holding a ranking that a re-run can overturn, and the next step is to repeat the eval enough to put a bound on it.

The reliability profiler this site points toward is a pre-launch instrument designed to run repeated-trial evals and report a score with its interval against field norms; no measured number exists yet, so the page states none; the figures above each carry a citation you can follow. The wider view, each eval method set against the four ways a score can still read green while lying, is mapped in the LLM evals hub. The canonical definitions behind these metrics live in the lane’s vocabulary, and the measured versions of the schematic figures above take shape in the research program.