Eval rigor

Is your LLM-as-a-judge reliable? Test the evaluator

An LLM-as-a-judge is a fallible evaluator. Its reliability breaks along three axes, agreement, calibration, and bias, each with a test and a correction. This hub routes to all three.

By LatentEval Published 2026-04-01 Updated 2026-05-25

An LLM-as-a-judge is an LLM standing in for a human grader, scoring another model’s outputs pass or fail so a bake-off, a release gate, or a leaderboard can move without a human in the loop. What decides whether to trust its verdict is not how confident the judge sounds but how reliable it is as an instrument: whether it agrees with humans, holds calibration, and resists the biases that swing close calls. This hub answers that reliability question and routes you to the test, the correction, and the vocabulary behind each.

A judge is a measuring instrument, and you do not publish a measurement without characterizing the instrument first. Judge reliability separates into three properties: how well the judge agrees with a human reference, whether its pass rate is a calibrated estimate of true accuracy, and how much systematic bias tilts its close calls. Each property has its own test, its own correction, and its own page below.

The reason teams skip the question is a single number. A strong LLM judge matched human preferences on MT-Bench at an 85% agreement rate, above the 81% humans reach with each other (Zheng et al., Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena, NeurIPS 2023 Datasets and Benchmarks Track, arXiv:2306.05685). That single figure is why so much grading got handed to a model. It is reported as a bare point estimate with no confidence interval, and it is an average across easy and hard comparisons alike, which is exactly why one headline cannot certify a judge for the decisions you care about.

This hub is scoped to the reliability of the LLM doing the grading. The broader problem of evaluating an agent’s whole trajectory has its own pillar; the subject here is the grader itself.

The judge-reliability toolkit

These three resources are the working set for characterizing a judge before you trust its output. They group because each closes a different gap the 85% headline leaves open: one settles whether a distortion decides your close calls, one settles whether the number you report is unbiased, and one gives the vocabulary the other two are written in.

Resource	The reliability question it settles	What it hands you	Reach for it when
Per-bias detection tests (bias pillar)	Does a systematic distortion decide your close calls?	A test plus a correction per bias: position, verbosity, self-preference, format, and more	Any pairwise or open-ended judging, before you trust a close ranking
The reporting correction (reporting page)	Is the pass rate you are about to publish an unbiased estimate of true accuracy?	Sensitivity and specificity bias-correction, and a calibration-aware interval	Any judge score headed for a model card, a release gate, or a slide
Eval-reliability vocabulary (glossary + metrics)	What do agreement, calibration, and the aggregate metrics actually mean here?	Canonical definitions, including the suite-level metric that consumes judge labels	When a verdict feeds an aggregate, an SLA, or a post-mortem

Follow the routing by the failure you suspect. If your comparisons are close and a ranking hangs on them, start with each documented bias and the test that exposes it. If a pass rate is about to leave your team as a headline, start with the procedure that will bias-correct that score and attach a calibrated interval. And when a judge’s verdict flows into an aggregate, treat a suite-level metric that reads each verdict as ground truth as the place the judge’s error is silently absorbed into a summary statistic. The full reliability vocabulary sits underneath all three.

The three axes a judge fails along

The synthesis below is the hub’s organizing claim: judge reliability resolves into three separable axes, and a judge can pass any one while failing the others. Each row names what the axis asks of the judge, how you measure it, and the one thing that goes wrong when you ignore it.

Reliability axis	What it asks of the judge	How you measure it	Why it moves the verdict
Human agreement	Do the judge’s verdicts match a trusted human reference?	Agreement rate against human labels, or a chance-corrected coefficient such as Cohen’s kappa when one verdict dominates	A judge that diverges from humans is scoring something other than the quality you asked it to score
Calibration	Is the judge’s pass rate an unbiased estimate of true accuracy?	The judge’s sensitivity and specificity on a small human-labeled set, then a bias-corrected rate with its interval	An uncalibrated judge offsets the reported score by a fixed bias that more test items estimate more precisely rather than remove
Bias	Do systematic distortions tilt the close calls?	Controlled perturbations: swap order, match length, mask identity, inject a cue, vary formatting	Bias concentrates on the near-ties that decide a ranking, so it can flip a leaderboard while the average looks clean

Human agreement is the axis the field usually reports, and it is the shallowest. An agreement rate counts how often the judge and a human land on the same verdict, but raw agreement inflates whenever one answer is obviously better, because both graders get the easy calls right for free. A chance-corrected coefficient like Cohen’s kappa exists to strip that free credit out. Either way, agreement measures resemblance to a reference; whether the reported number is right is a separate question.

Calibration is that second, harder axis. Treat the judge as an imperfect classifier and its raw pass rate becomes a biased estimate of true accuracy, off by an amount set by its own error rates. The correction is mechanical once you have measured the judge on a human-labeled calibration set, and the worked reporting procedure carries the arithmetic and the interval end to end. The point to hold here is that this bias does not wash out with a bigger test set.

Bias is the axis that decides close comparisons. Position, verbosity, self-preference, and formatting each push a verdict in a consistent direction, and a controlled perturbation is what tells taste apart from error. The per-bias map pairs every documented distortion with the swap, length-match, identity-mask, or cue injection that catches it, so this hub points at that page rather than re-running the tests here.

Why a high agreement score is not a clean bill of health

The three axes are independent: a judge can post a strong agreement rate and still be miscalibrated, because agreement counts matched verdicts while calibration asks whether the resulting rate is unbiased. Those are different questions with different answers.

The same judge can agree with humans in aggregate and still be biased on the pairs that matter, because bias hides where the comparisons are close and the aggregate is dominated by the comparisons that are not. Averaging over a suite conceals exactly the region a ranking is decided in.

So a judge that clears one axis has told you nothing about the other two. The 85% headline is an agreement figure. It does not report an interval, it does not correct for the judge’s classifier error, and it does not isolate the distortions that flip close calls. Reading it as a certificate of reliability is reading one axis as if it were three.

That is why the toolkit is three tests: each one settles a single axis. You run the agreement check to confirm the judge is scoring the right construct at all, the calibration correction to make the reported number unbiased, and the bias tests to protect the decisions the average papers over. Skip any one and a defect on that axis ships unmeasured.

Where an unreliable judge leaks into the rest of the system

A judge is a shared dependency, so its error is correlated across everything that reads its verdicts. The same evaluator gates the release, orders the leaderboard, feeds the suite-level aggregate, and gets pointed at post-mortems to say which step failed. One bias in the grader is not one wrong grade; it is the same wrong grade repeated in every place a downstream number trusts the judge, so the error propagates into each of those numbers at once.

Two leaks are worth naming because they are easy to miss. When a judge’s per-task verdicts roll up into an aggregate, feeding it uncorrected labels lets the bias into the summary one task at a time, which is why correction belongs upstream of the roll-up rather than bolted onto the headline. And when a judge is used to attribute a failure to the responsible step in a trace, a self-preferring grader will systematically excuse output that looks like its own, so the post-mortem points at the wrong agent. The same conflict makes a model a poor grader of its own family, the reason an independent verifier matters in a multi-agent system as much as it does for a judge.

This is where judge reliability meets the reliability of the systems a judge grades. An evaluator you cannot defend contaminates every reliability claim built on top of it, so the judge is the first instrument to characterize, ahead of the systems it grades.

How to run the three checks

Run them in the order the decision needs. First confirm the judge agrees with a human reference on your task, because a judge that scores the wrong construct is not worth correcting. Then correct and report the pass rate with its calibration-aware interval, so the number that leaves your team is unbiased and carries its uncertainty. Then run the bias tests on the close pairs, and escalate the comparisons that move under a test to a jury or a human.

Every number in that sequence carries an interval or it does not ship, the same discipline the lane applies when you report a reliability rate with its spread and when you size a run count to the decision. A judge score reported as a constant is the same false precision those pages warn against, one layer up.

The site points toward a reliability profiler designed to characterize a judge on all three axes at once, reporting agreement, calibration, and each bias correction with a confidence interval; it is pre-launch, so that is design intent and no such number is claimed here.

From this hub, the two member pages are where the schematic gives way to method. The bias pillar turns the bias axis into a test-and-correction map, the reporting procedure turns the calibration axis into a corrected rate with its interval, and the research index logs the rates those methods yield, intervals included, as each evaluation clears.