Eval rigor

LLM-as-a-judge bias, and the tests that catch it

LLM-as-a-judge bias is systematic, measurable distortion in an evaluator. A per-bias map pairs each bias with a detection test and a correction, so you can tell when a judge's ranking would flip.

By LatentEval Published 2026-04-01 Updated 2026-05-25

A strong LLM judge agrees with human experts about 85% of the time, higher than two humans agree with each other, and that result is what convinced most teams to hand grading to a model (Zheng et al., 2023, arXiv:2306.05685). The average is real. It is also why the audit stopped. The real decision is narrower than that average. What matters is whether the specific ranking your judge produced this week would survive a swap of answer order, a match on length, and a mask over which model wrote what. A judge that scores 85% on average can still be wrong on exactly the comparisons you care about, because its errors have a direction you can measure and remove.

That direction is bias, and bias is correctable. The rest of this page is the correction: for each documented way an LLM judge is wrong, the test that exposes it and the adjustment that removes it.

The judge-bias-correction map

Skip the walkthrough of what “LLM-as-a-judge” means; the useful artifact is a per-bias map that tells you which distortion your judge is carrying and what to do about each one. We call the practice judge-bias correction: run a controlled test for each bias, apply the matching fix, and report how much the ranking moved. The table below inventories those distortions one bias at a time: what each rewards, how to detect it, the matching correction, and how far it bends a ranking.

Bias	What the judge actually rewards	Detection test	Correction	Verdict: how much it distorts a ranking
Position	The slot an answer sits in; most judges lean toward the first	Swap-consistency: score (A, B), then score (B, A); disagreement is bias	Randomize order, run both directions, keep only verdicts consistent both ways	High on close pairs, near zero on lopsided ones; the cheapest to catch, so catch it first
Verbosity / length	Token count as a proxy for effort	Length-controlled pairs: regress score on length, or compare length-matched rewrites	Length-control the score after the fact, or hold length fixed in the rubric	Heavy on open-ended tasks; length-control moved a public leaderboard’s human correlation from 0.94 to 0.98
Self-preference	Text the judge recognizes as its own	Identity-masking: strip provenance, then run a cross-family judge panel on the same pairs	Blind the author, judge with a different model family, or use a jury and vote	Real but partly earned; the fix isolates the unearned part
Bandwagon	A stated majority or popularity signal	Cue injection: add “90% of reviewers prefer A”, measure the flip rate	Strip vote counts and popularity cues from the prompt and the candidate text	Low if the prompt is clean, dangerous the moment you paste in votes
Authority	Citations, credentials, confident sourcing	Cue injection: attach a fabricated citation to the weaker answer, measure the flip	Remove authority markers before judging; score the evidence and ignore its packaging	Bites hardest on factuality tasks where a citation reads as proof
Format / style	Markdown, headers, lists, visual structure	Style-controlled pairs: same content, one formatted, one plain prose	Normalize formatting across candidates, or a substance-only rubric	Largest surface bias measured, above position bias for all judges
Sentiment / tone	A positive or confident emotional register	Tone-controlled pairs: rewrite the register, hold the substance constant	Score claims and evidence while ignoring affect; judge on a de-toned copy	Quiet but present on subjective tasks; negligible on verifiable ones

The verdict column is the one that changes what you do. It ranks the biases by how much they actually distort a decision, and that ranking turns on your task and the closeness of the comparison: a distortion that vanishes on lopsided pairs can still decide every close one. A biased judge is a single evaluator whose distortion reaches every comparison it touches, so its blast radius is the whole leaderboard it produces, every grade at once.

A caveat on the magnitudes: every bias figure in this page is a point estimate its source reports without a confidence interval, position consistency, win-rate gaps, robustness rates, and style-bias scores alike. Treat them as directional. That gap is itself the finding for an eval-rigor page, because the literature that measures judge bias mostly reports it the way it warns you not to report eval scores, as a single number with no interval. So the discipline lands on the correction: when you remove a bias, report the change in your ranking with an interval, the same way a reliability rate means little without the interval around it.

A reliability profiler is what this site is building toward, an instrument whose design intent includes running these judge-bias tests against your own evaluator and reporting each correction with a confidence interval; it has not launched, so no such number is measured, and this page carries none. The full reporting method, how to fold a set of bias tests into one corrected ranking with its interval, is the rigor-bridge companion to this page and is not yet published.

Why an 85% judge still ships biased rankings

The agreement headline is an average over every comparison in the suite, and most comparisons are lopsided. When one answer is clearly better, a judge lands the verdict whether it is biased or not, so the easy pairs inflate the aggregate. Systematic bias concentrates on the close pairs, which are precisely the ones a ranking turns on. A judge can therefore be 85% right across the whole suite and still order the top few contenders wrong, because the pairs it gets wrong are the pairs that decide the leaderboard. Averaging hides exactly the region you are trying to resolve.

This is a construct-validity problem, so validity is the question. A biased judge is scoring a proxy, position or length or formatting, that tracks quality on easy pairs and diverges from it on hard ones. The correlation with human preference looks strong in aggregate and breaks down where the proxy and the construct part ways. An evaluator can be accurate on average and invalid on the decisions that matter.

Bias and noise are two different failures, and they need two different instruments. Noise is run-to-run variance: re-run the same judgment and the verdict wanders, so a confidence interval sizes it and more runs shrink it. Bias is systematic: a controlled test moves the verdict in a consistent direction every time, so more runs only estimate the bias more precisely, they never remove it. A confidence interval catches the noise and a controlled swap, length-match, or identity mask catches the bias. You need both, and reporting one while skipping the other leaves half the error uncounted. There is a second requirement hiding in the word bias itself: a preference is only a bias when it diverges from a reference the judge is supposed to match. That reference is a human label or a verifiable ground truth. Without one, a judge that prefers longer answers is just a judge with a taste, and you cannot tell taste from error.

Position bias: does the verdict survive a swap?

Position bias is the base case, and the one with the cleanest test. “Position bias is when an LLM exhibits a propensity to favor certain positions over others” (Zheng et al., 2023). Show the same judge the same two answers in the opposite order and a position-biased judge changes its mind. In the original study GPT-4 was the steadiest judge tested, holding its verdict on more than 60% of order-swapped pairs, while weaker judges flipped far more often, and most judges leaned toward whichever answer appeared first. Later work put a robustness number on it: on the CALM framework’s robustness rate, where “a higher RR indicates that the model’s judgment is less affected by the bias,” position robustness ran from about 0.57 for an older ChatGPT judge to 0.83 for a stronger Claude judge (Justice or Prejudice?, arXiv:2410.02736, preprint).

The detection test is swap-consistency: score the pair as (A, B), then score it again as (B, A), and treat any disagreement as position bias rather than as signal. Running the judge twice with the order flipped is a controlled perturbation of the evaluator, the same fault-injection logic you would point at an agent, aimed at the grader instead. Suppose a judge flips its verdict on 30 of 100 order-swapped pairs. Its swap-consistency is 70 out of 100, and if only 40 of those pairs were genuinely close calls, most of the disagreement lands on the pairs that actually decide the ranking (computed from stated assumptions, not measured).

The correction has three parts. Randomize order so no candidate gets a structural advantage across a suite. Run every decisive comparison in both directions. Keep only the verdicts the judge gives consistently both ways, and send the pairs where it flips to a tie or to a human. Position bias collapses on lopsided pairs, where the better answer wins in either slot, and bites hardest on close ones, which is exactly where a ranking gets decided. That is why it earns the first correction on every pairwise task: it is the cheapest to run and it protects the comparisons most likely to be wrong.

Length and verbosity: is the score paying for words?

Judges reward length as a stand-in for effort, and the effect is easy to weaponize. Zheng et al. built a “repetitive list” attack that padded an answer with redundant restatements and measured how often a judge then preferred the longer version; they found all the judges tested were prone to verbosity bias, though GPT-4 resisted it better than the others (Zheng et al., 2023). CALM’s robustness numbers agree on the ordering: verbosity robustness sat higher than position robustness for strong judges, around 0.90 to 0.95, so length sways them less than order does, though not to zero (Justice or Prejudice?, arXiv:2410.02736, preprint).

The detection test is a length-controlled comparison. Either regress the judge’s score on answer length across your suite and look for a slope, or build length-matched rewrites that hold content fixed while varying only word count. The correction with the strongest published backing is length-control after the fact. AlpacaEval’s length-controlled variant fits a model that answers a counterfactual, “What would the preference be if the model’s and baseline’s output had the same length?”, then conditions its prediction on zero length difference (Dubois et al., 2024, arXiv:2404.04475, peer-reviewed, COLM 2024). The payoff is measurable: length-controlling “increases the Spearman correlation with LMSYS Chatbot Arena from 0.94 to 0.98” and hardens the metric against models that pad their outputs to win.

On open-ended tasks this is often the correction that moves a ranking the most. A corrected win rate belongs next to a consistency metric carrying its own interval, because a debiased point estimate with no spread has only traded one false precision for a quieter one.

Self-preference: is the judge grading its own homework?

When the model that produced an answer is also the model grading it, a distinct bias appears. Self-preference is the case where “an LLM evaluator scores its own outputs higher than others’ while human annotators consider them of equal quality” (Panickssery et al., 2024, arXiv:2404.13076). The mechanism is recognition: the same work reports a “linear correlation between self-recognition capability and the strength of self-preference bias,” so a judge that can tell its own text apart is the judge most likely to reward it. The complication is that not all of that preference is illegitimate. Stronger models often prefer themselves partly because their answers are genuinely better, so the correction isolates the unearned part instead of deleting the whole gap.

Sizing it precisely is hard, and the field is candid about that. Zheng et al. observed GPT-4 rating its own answers with a 10% higher win rate and Claude-v1 with a 25% higher win rate, then flagged that “due to limited data and small differences, our study cannot determine whether the models exhibit a self-enhancement bias” (Zheng et al., 2023). A point estimate the authors themselves decline to certify is a good reason to correct by design instead of arguing about the magnitude.

The detection test is identity-masking: strip any provenance from the candidates, then run a panel of judges from different model families over the same pairs and watch for a judge that systematically favors the family it belongs to. The correction follows the same line: blind the author, prefer a judge from a different family than the systems under test, or use a jury and vote so no single model’s self-preference decides the outcome. This matters most when a judge is used to attribute failures, because a judge that flatters its own family will misassign blame in a post-mortem. The exposure is concrete: MAST scaled its failure taxonomy to more than 1,600 traces with an LLM-as-judge annotation pipeline, so any self-preference in that judge rides straight into the 1,600-trace dataset its prevalence figures rest on.

The injected-cue biases: bandwagon and authority

Two biases share a mechanism: the judge is swayed by a cue planted in the text rather than by the content. Bandwagon bias is “the tendency to give stronger preference to the majority’s beliefs regardless of whether they are correct or not,” and authority bias is “the tendency to assign more credibility to statements made by authority figures, regardless of actual evidence” (Justice or Prejudice?, arXiv:2410.02736, preprint). Both turn a social signal into a score.

The detection test for both is cue injection. To probe bandwagon, add a line like “90% of reviewers prefer answer A” and measure how often the verdict flips toward the crowd. To probe authority, attach a fabricated citation or a credential to the weaker answer and measure the same flip. Injecting a controlled, false cue and watching the evaluator move is chaos engineering pointed at the judge: you induce a fault on purpose to see how far it travels into the output.

The correction is to keep the cues out of the judge’s context in the first place. Strip vote counts, star ratings, and popularity signals before grading, and remove or normalize citations and credentials so the rubric scores the underlying evidence and ignores its packaging. These biases stay low-risk when your judge prompt is clean and turn dangerous the moment you paste user feedback, retrieval snippets, or leaderboard votes into the context, because then you are feeding the judge the exact cues that trigger them. Retrieval-augmented judging is the common trap: the retrieved passages a judge reads to check a claim arrive wrapped in source names, journal titles, and confident phrasing, so the authority cue rides in with the evidence and the judge scores the packaging alongside the fact. A biased judge here is a single point whose mistake propagates into every downstream decision that trusts the ranking.

The surface biases: format and sentiment

The last pair rewards presentation over substance, and one of them is larger than the biases that get all the attention. A systematic study of judge biases found that “Style bias is the dominant bias (0.10-0.76 across models, favoring markdown over plain prose),” exceeding position bias for every model measured (Soumik, 2026, arXiv:2604.23178, TMLR 2026). Identical content scored higher when it wore headers and bullet lists. Sentiment bias is the tonal cousin: “the preference for expressions of positive or negative emotions, affecting its judgment of emotional content” (Justice or Prejudice?, arXiv:2410.02736, preprint), where a confident, upbeat register outscores a flat one carrying the same claims.

The detection test for both holds substance constant and varies only the surface. For format, score the same answer twice, once as plain prose and once formatted, and compare. For sentiment, rewrite the register from neutral to warm or confident without touching a claim, and watch the score.

The correction is to normalize the surface before judging: render every candidate in the same format and a neutral tone, or write a substance-only rubric that scores claims and evidence and explicitly ignores layout and affect. Format bias deserves more correction budget than it gets, because it distorts a ranking more than the position bias teams reflexively test for, so a style-controlled pass is not optional on formatting-sensitive work.

Turning tests into a corrected ranking

A list of biases becomes a method only when it produces a number you can defend. The fix is not a better judge; it is a correction protocol run before you trust the ranking. It has five steps.

Pick the biases your task actually exposes. Any pairwise comparison exposes position. Open-ended generation exposes length and format. Factuality tasks expose authority. Subjective tasks expose sentiment. A same-family judge exposes self-preference.
Run each detection test as a controlled perturbation, swap the order, match the length, mask the identity, inject the cue, and record the flip rate as a continuous number so a near-miss is visible.
Apply the matching correction from the map: randomize and require both-direction agreement, length-control, blind and diversify the judge, strip the cues, normalize the surface.
Re-rank on the corrected judgments and report the change with an interval, bootstrapped over your items and runs, so the corrected ranking carries the uncertainty the raw score hid.
Keep the judge only where the corrected and uncorrected rankings agree; escalate the pairs that move to a human or a jury.

Step four is where this connects to the rest of reliability measurement. A correction is only finished when it is reported the way any reliability result should be tested, with the spread that says whether the change is real. Suppose the raw judge gives model A a 60% win rate and the length-controlled judge gives it 52%. Bootstrapping over items might put a 95% interval of roughly six points on each rate, so the corrected two-point lead sits inside the noise and the correct call is a tie (computed from stated assumptions, not measured). The correction did its job precisely by dissolving a lead that was never real. The reporting-method companion that specifies step four in full, the rigor bridge from a set of bias tests to a single defensible ranking, is not yet published; this page names the tests, that one will formalize the arithmetic.

What to do with the pairs that move

The pairs where the corrected and uncorrected verdicts disagree are the output of this whole exercise, and they need a destination. Three options, in rising cost. The cheapest is to mark them ties and let the interval absorb them: a pair the judge cannot rank consistently is not a pair you should report a winner on. The next is a jury, a panel of judges from different model families voting on the pair, which averages out any one model’s self-preference and turns a single flip into a majority signal. The most expensive, and the ground truth the other two approximate, is human adjudication on the subset that survived every cheaper filter.

Route the volume accordingly. Most pairs are lopsided and survive every test unchanged, so they stay with the automated judge. A minority flip under one test and route to the jury. A smaller minority flip under several and route to a human. This is the same triage the reliability work applies elsewhere: spend the expensive review where the automated signal is weakest and let the strong signals stand elsewhere. The result is an evaluator you can defend line by line, because every verdict either survived the bias tests or was escalated past them.

When correcting the judge is not worth it

Judge-bias correction costs runs and engineering time, and there are tasks where it barely moves the answer. If your evaluation is verifiable, unit tests, exact match, a checkable numeric result, the judge is not the source of truth and its biases never reach the score. If your comparisons are lopsided, where the stronger answer wins in any order and any format, a swap test alone confirms the ranking and you can stop. And if you ship a single model with no self-grading and clean judge prompts, self-preference and the injected-cue biases are not yet your exposure. Correction is a reliability spend, and like any reliability budget it belongs where the leak is, and spreading it evenly across biases that cannot touch your task wastes it. A judge’s bias profile also drifts. Swap the judge model, or let a provider update it under the same name, and the swap-consistency and length slopes you measured last quarter can move, so a correction that passed once has to be re-run whenever the evaluator changes.

The decision rule is short. Before you trust a ranking from an LLM judge, run the swap. If the pair is close, add the length match and the identity mask. Correct what moves and report the corrected ranking with its interval. If nothing moves, you have earned the judge for that comparison. If something does, you have caught a bias that a single 85% agreement score would have shipped, and you have caught it on the comparison that mattered. The vocabulary behind these tests lives in the reliability glossary, the reliability of the systems a judge grades is its own discipline, and the published runs and their methods are where the correction numbers, with their intervals, stop being schematic.